Bringing reasoning to Granite

shaking hands across collaboration desk

Authors

David Cox

VP, AI Models; IBM Director, MIT-IBM Watson AI Lab

Kate Soule

Director, Technical Product Management, Granite

IBM

To help solve the problems that we face in our daily lives, AI needs to be able to reason through things like we do. To that end, we’re excited to announce a preview release of new reasoning capabilities in our Granite family of large language models.

One of the most exciting new frontiers in generative AI is improving model performance by using more inference-time compute so that the model can reason about a task before answering. By simply allowing models to spend some time “thinking” at inference-time, allowing it to break a problem down into multiple steps and consider an issue from multiple angles, the quality of the model’s final answer will markedly improve.

With this preview release, available in the Granite Experiments section of our HuggingFace page and on IBM watsonx.ai for ease of inferencing at no cost, we’re providing a sneak peek at some of the new reasoning capabilities that will be coming in our next official release Granite 3.2. Our approach has unique features- trained with trusted and transparent data, our reasoning model preserves general performance and safety characteristics, and a user can turn on and off our model’s reasoning capabilities so that it is only used when it makes sense. We believe this pragmatic approach to reasoning models will enable developers to harness the potential of inference-time compute—without compromising the efficiency or trustworthiness of the model.

Teaching rocks to reason

At the core of many modern approaches to reasoning is the idea of “chain of thought,” which was originally uncovered by Google DeepMind in 2022. Researchers found that simply asking a model to think “step by step” could lead to significant improvements in the quality and correctness of answers produced. As an added bonus, this approach allows the model to effectively show its work, articulating a step-by-step sequence of reasoning that gets from the question to an answer.

More recently, teams at several companies and universities have combined this idea with reinforcement learning, which allows a model to learn by trial and error. DeepSeek recently captured the public imagination by using reinforcement learning to teach reasoning to their 671B parameter model, DeepSeek-R1. Taught to produce very long chains of thought before answering, DeepSeek-R1 achieves impressive performance in certain domains, such as solving math problems and coding. DeepSeek further demonstrated that smaller, third party models like Llama-3.1-8B-Instruct and Qwen-7B-Instruct could also be taught to reason through a process called “model distillation”. In this process, the smaller models are finetuned on a large amount of reasoning examples produced by DeepSeek-R1.

We have been working on similar ideas and developing some of our own reinforcement learning-based techniques for triggering chain-of-thought reasoning across any domain. Our approach does not rely on a large teacher model like DeepSeek-R1 to drive model distillation but instead applies reinforcement learning directly on top of our granite-3.1-8b-instruct model. This approach helps ensure that critical characteristics like the original model’s safety and general performance are preserved.

Progress in AI research has been moving at an astonishing pace over the past few years, and a large part of that progress has come from mixing and matching techniques described by others in the open-source community. This is early work that we will continue to evolve and expect that we’ll continue to combine other methods from the community and additional ideas that we have cooking in IBM Research.

General-purpose reasoning without compromise

The figure below compares the general performance of the Granite, Llama, Qwen, models with and without reasoning.

We can see that the new reasoning capabilities of Granite lead to a huge jump in performance on benchmarks like ArenaHard and AlpacaEval, popular benchmarks that measure complex instruction following, without sacrificing performance in other domains.

Further, while DeepSeek-R1 and its distilled models boast impressive results in narrow domains like mathematics and code, the small distilled models show a loss in performance when evaluated on these generalist tasks. Granite shows how reasoning can boost performance across a wide range of tasks without sacrificing general performance. This is especially important when you consider the kinds of tasks that businesses rely on an LLM to perform, including instruction following, retrieval-augmented generation (RAG), and the key components of agentic workflows, such as function calling.

What’s equally important is that our approach aims to keep safety at the heart of what we do. While techniques such as the ones used by R1 can degrade model safety, our preview release shows that reasoning and safety don’t have to be a trade-off.

Reasoning when you want it

One issue with long chain-of-thought reasoning is that it can be quite costly. “Thinking” for a long time to get an answer can lead to a better answer, but not every task demands this level of deliberation.

For example, when asked “Where is Rome?”, in one test using deepinfra.com’s hosted version of DeepSeek-R1, the model took up to 50.9s to answer the question, producing multiple paragraphs of thoughts on how to approach the answer. When the same prompt is sent to DeepSeek-V3, an earlier version of the model without reasoning, it provided nearly an identical answer but only took 11.2s.

Generating long, detailed reasoning chains can take a long time to produce, which degrades the interactivity of an experience and ultimately costs more to run. This can be especially important in agentic workflows, where long thoughts can also extend the time needed to make many sequential LLM inference calls to create a plan of action, access information via external APIs, and react to the information that is received.

In today’s preview Granite release, we’re giving developers the ability to turn reasoning on and off as they want, simply by passing an argument along with the model to modulate its behavior. If you want to invoke reasoning to spend more time on a critical question, it’s as easy as passing a flag that activates the feature. Without the flag, the model runs as it normally would. Conditional activation of reasoning is a simple, but powerful feature, which puts the developer in control of how the model works.

Let us know what you think

Our preview model is available on IBM’s Granite Hugging Face page and IBM watsonx.ai. As with the mainline releases of all our Granite model, this preview is available under a clear, no-nonsense Apache 2.0 open license, which allows you to use the models however you like. We invite you to try them out, and we’re looking forward to hearing any feedback you have as you use them. In the meantime, we’re busy working on baking these and other advanced features into Granite 3.2.