Granite 4.0 arrives as the latest entrant in a crowded field. OpenAI, Meta and Mistral have each unveiled increasingly larger models measured in the hundreds of billions of parameters. IBM is taking a different route. By emphasizing efficiency, the company is betting that enterprises will prioritize cost, governance and reliability over raw scale.

At the center of the release is a hybrid design that blends conventional transformer layers with Mamba-2, an emerging sequence model. Transformers have long powered modern AI, but their compute demands rise quadratically with the length of the input. That design makes them expensive and slow for long documents or multiple concurrent sessions. By contrast, Mamba scales linearly, reducing the memory burden. IBM’s approach combines the two.

The result is a family of models that can reduce memory use by as much as 70 to 80% compared with transformer-only systems. Soule said the shift is aimed directly at the pain points enterprises face when moving beyond pilots. Many proofs of concept are built on massive models, but those systems often prove too costly to run at scale.

“You do the math and see both the cost of how many GPUs you would need to host and maintain, potentially on-prem, and what real-world users will tolerate from latency perspectives,” she said. “Customers quickly realize they need smaller options to deploy at scale.”