Hybrid thinking: Inside the architecture of IBM’s Granite 4.0

closeup on side angle of 3d squares of mint green, lime green, black and blue

Author

Sascha Brodsky

Staff Writer

IBM

AI has a scale problem.

The largest language models today can generate astonishing text, but they do so at enormous cost. Their transformer-based architecture, a system that compares every word to every other word to capture relationships, devours memory as input length grows. That pushes enterprises into a cycle of buying ever more powerful GPUs and absorbing spiraling compute bills. This tension between performance and practicality has become one of the defining challenges of today’s AI adoption.

A tale of two architectures

Granite 4.0, IBM’s newest family of open-weight models, is built around a different idea. Instead of chasing brute-force scale by way of a trillion parameters, IBM is betting on a hybrid architecture that blends transformers with sequence state models (SSMs), which store and update information over time to handle longer inputs more efficiently.

A transformer is a neural network that reads an entire text sequence in parallel, using a mechanism called self-attention to determine how words relate to one another in context. By weighing those relationships, it builds a structured understanding of meaning, which allows it to generate coherent, human-like responses rather than disconnected strings of words.

Sequence state models take a different route. Instead of reading all the words at once, as transformers do, they process one token at a time and update an internal “state” that summarizes what came before. This makes them far more memory-efficient, since they don’t need to store every relationship across the entire sequence.

By combining the two, Granite 4.0 aims to reduce the transformer’s quadratic memory burden while maintaining its accuracy and flexibility. The goal is to make long-context workloads run more efficiently on hardware that enterprises can actually afford.

Keep your memory in check

“We see memory reductions as large as 70%,” said Kate Soule, Director of Technical Product Management for IBM’s AI models, in an interview with IBM Think. Soule explained that the new architecture dramatically improves efficiency, reducing the need for clusters of high-end GPUs like the NVIDIA H100. Instead of spreading workloads across several chips, she said, enterprises can now run full production tasks on a single H100.

Early examples like Mamba and Mamba-2 showed that SSMs could handle very long texts that transformers found difficult, although they were weaker on tasks where transformers excel, like few-shot reasoning.

Granite 4.0 uses a Mamba-2 hybrid, consisting primarily of SSM layers interleaved with the occasional transformer layer. The SSM layers manage long-range memory, while the transformer layers add the fine-grained attention needed for reasoning and pattern recognition. The design aims to overcome a key limitation of traditional transformers, which require rapidly increasing memory as inputs grow longer. “By using a hybrid architecture, which is only partially transformer-based, we are able to introduce a whole new type of architecture that does not have that constraint,” Soule said. “Memory scales linearly.”

The results are not theoretical. In internal IBM studies, Soule said, Granite hybrids used more than two-thirds of their layers as SSMs. That matters in real-world deployments, from customer service centers managing thousands of conversations to development teams working with entire code repositories. In tasks where transformer models can sometimes stall or crash, hybrids tend to maintain steadier throughput. Soule put it plainly: “It’s really a memory-saving game.”

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Starting a new trend

Others in the field are exploring the same path. NVIDIA’s Nemotron-H series combines Mamba layers with transformers and reports notable speedups. Research papers have found that hybrids and SSMs outperform transformers at very long contexts in both speed and efficiency—and that for vision and multimodal tasks, hybrids such as MambaVision and MaTVLM combine SSM and transformer components to extend the efficiency gains beyond text.

What sets Granite 4.0 apart is not just the hybrid itself but its packaging for enterprise. The models in the family are open-weight, meaning their parameters are available for anyone to inspect. Granite 4.0 is certified under the new ISO/IEC 42001:2023 standard for responsible AI management systems, cryptographically signed to guarantee provenance and covered by IBM’s indemnity when used on watsonx.ai. IBM has also launched a bug bounty program with HackerOne, offering rewards for vulnerabilities or alignment issues.

The architecture also changes the limits of context length. Granite 4.0 models were trained on sequences of over half a million tokens and validated at up to 128,000. Unlike transformers, which rely on positional encoding to keep track of order and can falter at unseen lengths, SSMs inherently preserve sequence order because they read sequentially. With Granite 4.0, IBM removed positional encoding entirely, eliminating one of the structural limits that has long capped transformer models.

“The architecture is not putting a hard constraint on how many tokens the model can consume at one time,” Soule said. “You could keep pushing that boundary further and further.”

For industries with sprawling data, that freedom matters. Financial firms analyzing years of trades, healthcare providers integrating patient records or law firms handling massive case files all face inputs that can outgrow transformer models. The hybrid approach creates architectural headroom for those tasks.

Performance numbers show Granite hybrids competing at the top of their weight class, with the three-billion-parameter hybrid outpacing IBM’s earlier eight-billion-parameter transformer. Granite-4.0-H Small, with 32 billion parameters and nine billion active, ranks near the top of instruction-following benchmarks, outperforming nearly all open models that are many times its size. It also performs strongly on function calling, where AI systems trigger external tools, and on retrieval-augmented generation, a method of pulling in outside data to improve answers.

Soule said the aim is not to dominate every chart, but to deliver where it matters. “Where we’re focusing with Granite is on the core tasks and strengths we want the model to excel at: retrieval augmented generation, tool calling, instruction following.”

Related solutions
Model customization with InstructLab

See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.

Discover watsonx.ai
AI for developers

Move your applications from prototype to production with the help of our AI development solutions.

Explore AI development tools
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Enhance AI model performance with end-to-end model customization with enterprise data in a matter of hours, not months. See how InstructLab enables developers to optimize model performance through customization and alignment, tuning toward a specific use case by taking advantage of existing enterprise and synthetic data.

Explore watsonx.ai Explore AI development tools