New prompting techniques tackle model bloat

31 March 2025

Author

Aili McConnon

Tech Reporter

IBM

As reasoning models like OpenAI’s o1, DeepSeek-R1 and Google’s Gemini 2.5 compete to top AI intelligence benchmarks, enterprises looking to integrate AI are becoming increasingly wary of something called “model bloat"—the phenomenon whereby models become unnecessarily large or complex, pushing up computational costs and model training time and decreasing the speed at which they can provide the responses enterprises need.

OpenAI’s o1 and DeepSeek-R1 use chain of thought (CoT) reasoning to break complex problems into steps, achieving unprecedented performance and greater accuracy than prior models. But CoT also demands substantial computational resources during inference, leading to lengthy outputs and higher latency, says Volkmar Uhlig, a VP and AI Infrastructure Portfolio Lead at IBM, in an interview with IBM Think.

Enter a new class of prompting techniques, described in various new papers, ranging from atom of thought (AoT) to chain of draft (CoD), seeking to increase the efficiency and accuracy of CoT by helping models solve problems more quickly—thereby cutting down on costs and latency.

AI scientist and startup founder Lance Elliott sees the new offshoots of chain of thought as variations in a prompt engineer’s toolkit. “Your typical home handiwork toolkit might have a regular hammer—that would be CoT,” he tells IBM Think. “AoT would be akin to using a specialized hammer used for situations involving cutting and adjusting drywall. You could use a regular hammer for drywall work, but it would be advisable to use a drywall hammer if you had one and knew how to use it properly."

Vyoma Gajjar, an AI Technical Solution Architect at IBM, sees potential in these new CoT cousins, especially for enterprises “looking for more cost-efficient ways to prompt small models to get accurate answers for their specific use cases,” she says.

Atom of thought: Thinking faster by dividing and conquering

In contrast to chain of thought, which solves complex problems by breaking them into detailed, sequential steps, AoT uses a divide-and-conquer strategy. Specifically, AoT splits the steps of a problem into “atomic questions" that are processed in parallel, as the authors of one paper from the Hong Kong University of Science and Renmin University of China explain, then assembles the individual solutions to reach a final answer.

AoT can function as both a standalone framework and as a plug-in enhancement. When the authors used AoT with OpenAI’s GPT-4o mini, it surpassed several reasoning models across six baseline benchmarks, including o3-mini by 3.4% and DeepSeek-R1 by 10.6% on the HotpotQA dataset.

Gajjar sees promise in AoT for enterprise applications that seek to balance performance with maintaining a given cost profile. “The separate tasks run in parallel, and then you let these tasks, or ‘atoms,’ speak to each other, to get the most accurate solution, as an electron speaks to a proton,” she says in an interview with IBM Think.

The paper’s authors confirm that AoT reaches “competitive performance at significantly lower computational costs compared to existing methods,” adding that “this enhanced efficiency can be attributed to our atomic state representation that preserves only necessary information while eliminating redundant computations.”

AoT doesn’t work well for all use cases, however. Elliott, the AI scientist, says that AoT is most likely to be helpful “when using generative AI for deriving mathematical proofs, producing programming code, and for highly structured reasoning tasks.” And it would be less likely to improve efficiency with creative writing tasks and engaging in conversation, he says.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Chain of draft: Thinking faster by writing less

Meanwhile, chain-of-draft prompting tackles the bottleneck that can occur when reasoning models produce verbose, highly detailed steps that increase latency. This phenomenon represents a key difference between reasoning models and humans, who tend to “rely on concise drafts or shorthand notes to capture essential insights without unnecessary elaboration,” write the authors from Zoom Communications in a new paper on CoD.

“The latency issue has often been overlooked,” the paper’s authors write. “However, it is crucial for lots of real-time applications to have low latency while maintaining high-quality responses.”

With CoD prompting, an LLM is encouraged to produce a concise explanation as it reasons its way to an answer. For example, the CoT control prompt said, “Think step by step to answer the following question. Return the answer at the end of the response after a separator ####.” In contrast, the CoD prompt instructed the model to “Think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator.”

Using OpenAI’s ChatGPT-4o and Anthropic’s Claude 3.5 Sonnet, the researchers found that CoD matched or surpassed CoT in accuracy while using 92.4% fewer tokens, reducing cost and latency across various reasoning tasks.

“We are in a whole new world of algorithmic exploration,” says IBM’s Uhlig. “If you prompt train differently, you can dramatically reduce the number of tokens. This is a very natural next step.”

The use case will determine which prompting technique is best

While many new prompting techniques continue to appear, one called “skeleton of thought” (SoT) is notable for combining elements of both atom of thought and chain of draft. The authors of a paper proposing the technique say they were motivated by “the writing and thinking process of humans.” SoT prompting guides the LLM to generate the skeleton of an answer, then completes the content of each skeleton point in parallel.

Using skeleton of thought, the authors from Tsinghua University in China and Microsoft Research were able to speed up the functioning of various LLMs as well as improve the accuracy of answers in several categories. “We show the feasibility of parallel decoding of off-the-shelf LLMs without any changes to their model, system or hardware,” they write.

For instance, the researchers asked the model the question: “What are the most effective strategies for conflict resolution in the workplace?” Using SoT prompting, the authors decreased the latency from 22 seconds to 12 seconds (a 1.83x speed-up) with Claude, and from 43 seconds to 16 seconds (a 2.69x speed-up) with Vicuna 33B V1.3.

None of the prompting techniques will work for every challenge ultimately, the task at hand will determine the most efficient option in the “prompt engineer’s toolkit,” Elliott says. “Knowing how generative AI works under the hood is highly advantageous,” he explains. “It’s like driving a car. You don’t necessarily need to know the intricate details of how an engine or transmission works, but at least being familiar with some key principles can go a long way toward better handling an automobile. You are better prepared for situations such as icy roads, wet roads, driving on hilly roads and handling tight curves.”

AI Academy

Become an AI expert

Gain the knowledge to prioritize AI investments that drive business growth. Get started with our free AI Academy today and lead the future of AI in your organization.

Related solutions
IBM® watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo