A new class of AI models is challenging the dominance of GPT-style systems, promising faster, cheaper and potentially more powerful alternatives.
Inception Labs, a startup founded by researchers from Stanford, recently released Mercury, a diffusion-based language model (dLLM) that refines entire phrases at once, rather than predicting words one by one. Unlike traditional large language models (LLMs), which use an autoregressive approach—generating one word at a time, based on the preceding text—diffusion models improve text iteratively, through refinement.
“dLLMs expand the possibility frontier,” Stefano Ermon, a Stanford University computer science professor and co-founder of Inception Labs, tells IBM Think. “Mercury provides unmatched speed and efficiency, and—by leveraging more test-time compute—dLLMs will also set the bar for quality and improve overall customer satisfaction for edge and enterprise applications.”
IBM Research Engineer Benjamin Hoover sees the writing on the wall: “It’s just a matter of two or three years before most people start switching to using diffusion models,” he says. “When I saw Inception Labs’ model, I realized, ‘This is going to happen sooner rather than later.’”
Diffusion models don’t play by the same rules as traditional AI. Autoregressive models like GPT build sentences word by word, predicting one token at a time. If a model is generating the phrase “To whom it may concern,” it predicts “To,” then “whom,” then “it,” and so on—one step at a time. Diffusion models flip the script. Instead of piecing together text sequentially, they start with a rough, noisy version of an entire passage and refine it in multiple steps. Think of it like an artist sketching a rough outline before sharpening the details, rather than drawing each element in order. By considering the whole sentence at once, diffusion models can generate responses faster, often with more coherence and accuracy than traditional LLMs.
Hoover sees the technology as a modern twist on an older concept. “Diffusion models are fundamentally error-correction mechanisms,” he says. “They work by starting with a noisy input, and gradually removing the noise until they arrive at the desired output.”
Diffusion models have been widely used in image generation, with models like DALL·E, Stable Diffusion and Midjourney refining noisy images into high-quality visuals. However, applying this approach to text is more difficult because language requires strict adherence to grammar and syntax.
“Many attempts to apply diffusion models to text generation have struggled in the past,” says Ermon. “What enabled Mercury to succeed where others failed are proprietary innovations in both training and inference algorithms.” Unlike images, which can be gradually cleaned up into recognizable forms, language follows rigid grammatical rules that make iterative refinement trickier.”
Hoover points to Inception Labs’ Mercury as a prime example of how diffusion models are closing the gap. “That model proved diffusion could hold its own and is actually faster and more efficient than comparable autoregressive models.”
The efficiency of diffusion-based LLMs could shake up AI deployment, particularly in enterprise applications where cost and speed matter. Traditional LLMs require substantial computing power, making them expensive to run. Diffusion models promise to deliver similar or better performance at a fraction of the cost. Diffusion models are often more efficient because they refine entire sequences in parallel rather than generating each word step by step like traditional LLMs, reducing computational overhead.
“Our customers and early adopters are developing applications powered by dLLMs in areas including customer support, sales and gaming,” Ermon says. “They’re making their applications more responsive, more intelligent and cheaper.”
Hoover sees an even broader impact. “Right now, AI is constrained by energy consumption,” he says. “Large models use enormous amounts of power. However, diffusion models operate differently, allowing for far greater efficiency. In the long run, we could see diffusion-based AI systems running on analog hardware, reducing energy costs dramatically.”
Analog computing, which processes information using continuous electrical signals rather than binary operations, has long been touted as a potential solution to AI’s energy problem. Hoover believes diffusion models are particularly well-suited to this approach.
“These models are inherently interpretable,” he says. “That means we can map their internal computations directly onto analog circuits, something that’s much harder to do with traditional deep learning architectures.”