As reasoning models like OpenAI’s o1, DeepSeek-R1 and Google’s Gemini 2.5 compete to top AI intelligence benchmarks, enterprises looking to integrate AI are becoming increasingly wary of something called “model bloat"—the phenomenon whereby models become unnecessarily large or complex, pushing up computational costs and model training time and decreasing the speed at which they can provide the responses enterprises need.
OpenAI’s o1 and DeepSeek-R1 use chain of thought (CoT) reasoning to break complex problems into steps, achieving unprecedented performance and greater accuracy than prior models. But CoT also demands substantial computational resources during inference, leading to lengthy outputs and higher latency, says Volkmar Uhlig, a VP and AI Infrastructure Portfolio Lead at IBM, in an interview with IBM Think.
Enter a new class of prompting techniques, described in various new papers, ranging from atom of thought (AoT) to chain of draft (CoD), seeking to increase the efficiency and accuracy of CoT by helping models solve problems more quickly—thereby cutting down on costs and latency.
AI scientist and startup founder Lance Elliott sees the new offshoots of chain of thought as variations in a prompt engineer’s toolkit. “Your typical home handiwork toolkit might have a regular hammer—that would be CoT,” he tells IBM Think. “AoT would be akin to using a specialized hammer used for situations involving cutting and adjusting drywall. You could use a regular hammer for drywall work, but it would be advisable to use a drywall hammer if you had one and knew how to use it properly."
Vyoma Gajjar, an AI Technical Solution Architect at IBM, sees potential in these new CoT cousins, especially for enterprises “looking for more cost-efficient ways to prompt small models to get accurate answers for their specific use cases,” she says.
In contrast to chain of thought, which solves complex problems by breaking them into detailed, sequential steps, AoT uses a divide-and-conquer strategy. Specifically, AoT splits the steps of a problem into “atomic questions" that are processed in parallel, as the authors of one paper from the Hong Kong University of Science and Renmin University of China explain, then assembles the individual solutions to reach a final answer.
AoT can function as both a standalone framework and as a plug-in enhancement. When the authors used AoT with OpenAI’s GPT-4o mini, it surpassed several reasoning models across six baseline benchmarks, including o3-mini by 3.4% and DeepSeek-R1 by 10.6% on the HotpotQA dataset.
Gajjar sees promise in AoT for enterprise applications that seek to balance performance with maintaining a given cost profile. “The separate tasks run in parallel, and then you let these tasks, or ‘atoms,’ speak to each other, to get the most accurate solution, as an electron speaks to a proton,” she says in an interview with IBM Think.
The paper’s authors confirm that AoT reaches “competitive performance at significantly lower computational costs compared to existing methods,” adding that “this enhanced efficiency can be attributed to our atomic state representation that preserves only necessary information while eliminating redundant computations.”
AoT doesn’t work well for all use cases, however. Elliott, the AI scientist, says that AoT is most likely to be helpful “when using generative AI for deriving mathematical proofs, producing programming code, and for highly structured reasoning tasks.” And it would be less likely to improve efficiency with creative writing tasks and engaging in conversation, he says.
Meanwhile, chain-of-draft prompting tackles the bottleneck that can occur when reasoning models produce verbose, highly detailed steps that increase latency. This phenomenon represents a key difference between reasoning models and humans, who tend to “rely on concise drafts or shorthand notes to capture essential insights without unnecessary elaboration,” write the authors from Zoom Communications in a new paper on CoD.
“The latency issue has often been overlooked,” the paper’s authors write. “However, it is crucial for lots of real-time applications to have low latency while maintaining high-quality responses.”
With CoD prompting, an LLM is encouraged to produce a concise explanation as it reasons its way to an answer. For example, the CoT control prompt said, “Think step by step to answer the following question. Return the answer at the end of the response after a separator ####.” In contrast, the CoD prompt instructed the model to “Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator.”
Using OpenAI’s ChatGPT-4o and Anthropic’s Claude 3.5 Sonnet, the researchers found that CoD matched or surpassed CoT in accuracy while using 92.4% fewer tokens, reducing cost and latency across various reasoning tasks.
“We are in a whole new world of algorithmic exploration,” says IBM’s Uhlig. “If you prompt train differently, you can dramatically reduce the number of tokens. This is a very natural next step.”
While many new prompting techniques continue to appear, one called “skeleton of thought” (SoT) is notable for combining elements of both atom of thought and chain of draft. The authors of a paper proposing the technique say they were motivated by “the writing and thinking process of humans.” SoT prompting guides the LLM to generate the skeleton of an answer, then completes the content of each skeleton point in parallel.
Using skeleton of thought, the authors from Tsinghua University in China and Microsoft Research were able to speed up the functioning of various LLMs as well as improve the accuracy of answers in several categories. “We show the feasibility of parallel decoding of off-the-shelf LLMs without any changes to their model, system or hardware,” they write.
For instance, the researchers asked the model the question: “What are the most effective strategies for conflict resolution in the workplace?” Using SoT prompting, the authors decreased the latency from 22 seconds to 12 seconds (a 1.83x speed-up) with Claude, and from 43 seconds to 16 seconds (a 2.69x speed-up) with Vicuna 33B V1.3.
None of the prompting techniques will work for every challenge ultimately, the task at hand will determine the most efficient option in the “prompt engineer’s toolkit,” Elliott says. “Knowing how generative AI works under the hood is highly advantageous,” he explains. “It’s like driving a car. You don’t necessarily need to know the intricate details of how an engine or transmission works, but at least being familiar with some key principles can go a long way toward better handling an automobile. You are better prepared for situations such as icy roads, wet roads, driving on hilly roads and handling tight curves.”
Learn how CEOs can balance the value generative AI can create against the investment it demands and the risks it introduces.
Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.
Learn how to confidently incorporate generative AI and machine learning into your business.
Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
IBM® Granite™ is our family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Learn how to select the most suitable AI foundation model for your use case.
Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.