2 May 2025
We’re excited to present IBM Granite 4.0 Tiny Preview, a preliminary version of the smallest model in the upcoming Granite 4.0 family of language models, to the open source community.
Granite 4.0 Tiny Preview is extremely compact and compute efficient: at FP8 precision, several concurrent sessions performing long context (128K) tasks can be run on consumer grade hardware, including GPUs commonly available for under $350 USD.1
Though the model is only partially trained—it has only seen 2.5T of a planned 15T or more training tokens—it already offers performance rivaling that of IBM Granite 3.3 2B Instruct despite fewer active parameters and a roughly 72% reduction in memory requirements.2 We anticipate Granite 4.0 Tiny’s performance to be on par with that of Granite 3.3 8B Instruct by the time it has completed training and post-training.
As its name suggests, Granite 4.0 Tiny will be among the smallest offerings in the Granite 4.0 model family. It will be officially released this summer as part of a model lineup that also includes Granite 4.0 Small and Granite 4.0 Medium. Granite 4.0 continues IBM’s firm commitment to making efficiency and practicality the cornerstone of its enterprise LLM development.
This preliminary version of Granite 4.0 Tiny is now available on Hugging Face—though we do not yet recommend the preview version for enterprise use—under a standard Apache 2.0 license. Our intent is to allow even GPU-poor developers to experiment and tinker with the model on consumer-grade GPUs. The model’s novel architecture is pending support in Hugging Face transformers and vLLM, which we anticipate will be completed shortly for both projects. Official support to run this model locally through platform partners including Ollama and LMStudio is expected in time for the full model release later this summer.
LLM memory requirements are often provided, literally and figuratively, without proper context. It’s not enough to know that a model can be successfully loaded into your GPU(s): you need to know that your hardware can handle the model at the context lengths that your use case requires.
Furthermore, many enterprise use cases entail not a lone model deployment, but batch inferencing of multiple concurrent instances. Therefore, IBM endeavors to measure and report memory requirements with long context and concurrent sessions in mind.
Granite 4.0 Tiny is one of the most memory-efficient language models available today. Even at very long contexts, several concurrent instances of Granite 4.0 Tiny can easily run on a modest consumer GPU.
Whereas prior generations of Granite LLMs utilized a conventional transformer architecture, all models in the Granite 4.0 family utilize a new hybrid Mamba-2/Transformer architecture, marrying the speed and efficiency of Mamba with the precision of transformer-based self-attention. Granite 4.0 Tiny-Preview, specifically, is a fine-grained hybrid mixture of experts (MoE) model, with 7B total parameters and only 1B active parameters at inference time.
Many of the innovations informing the Granite 4 architecture arose from IBM Research’s collaboration with the original Mamba creators on Bamba, an experimental open source hybrid model whose successor (Bamba v2) was released earlier this week.
Mamba is a type of state space model (SSM), introduced in 2023—about 6 years after the debut of transformers in 2017.
SSMs are conceptually similar to the recurrent neural networks (RNNs) that dominated natural language processing (NLP) in the pre-transformer era. They were originally designed to predict the next state of a continuous sequence (like an electrical signal) using only information from the current state, previous state, and range of possibilities (the state space). Though they’ve been used across several domains for decades, SSMs share certain shortcomings with RNNs that, until recently, limited their potential for language modeling.
Unlike the self-attention mechanism of transformers, conventional SSMs have no inherent ability to selectively focus on or ignore specific pieces of contextual information. So in 2023, Carnegie Mellon’s Albert Gu and Princeton’s Tri Dao introduced a type of structured state space sequence (“S4”) neural network that adds a selection mechanism and a scan method (for computational efficiency)—abbreviated as an “S6” model—and achieved language modeling results competitive with transformers. They nicknamed their model “Mamba” because, among other reasons, all of those S’s sound like a snake’s hiss.
In 2024, Gu and Dao released Mamba-2, a simplified and optimized implementation of the Mamba architecture. Equally importantly, their technical paper fleshed out the compatibility between SSMs and self-attention.
Mamba’s major advantages over transformer-based models center on efficiency and speed.
Transformers have a crucial weakness: the compute requirements of self-attention scale quadratically with context. In other words, each time your context length doubles, the attention mechanism doesn’t just use double the resources—it uses quadruple the resources. This “quadratic bottleneck” increasingly throttles speed and performance as the context window (and corresponding KV-cache) grows.
Conversely, Mamba’s computational needs scale linearly: if you double the length of an input sequence, Mamba uses only double the resources. Whereas self-attention must repeatedly compute the relevance of every previous token to each new token, Mamba simply maintains a condensed, fixed-size “summary” of prior context from prior tokens. As the model “reads” each new token, it determines that token’s relevance, then updates (or doesn't update) the summary accordingly. Essentially, whereas self-attention retains every bit of information and then weights the influence of each based on their relevance, Mamba selectively retains only the relevant information.
That being said, transformers’ more memory-intensive and computationally redundant method has its own advantages. For instance, research has shown that transformers still outpace both Mamba and Mamba-2 on tasks requiring in-context learning (such as few-shot prompting), copying, or long-context reasoning.
Fortunately, the respective strengths of transformers and Mamba are not mutually exclusive. In the original Mamba-2 paper itself, authors Dao and Gu suggest that a hybrid model could exceed the performance of a pure transformer or SSM—a notion validated by NVIDIA research from last year. To explore this further, IBM Research collaborated with Dao and Gu themselves, along with the University of Illinois at Urbana-Champaign (UIUC)’s Minjia Zhang, on Bamba and Bamba V2. Bamba, in turn, informed many of the architectural elements of Granite 4.0.
The Granite 4.0 MoE architecture employs 9 Mamba blocks for every 1 transformer block. In essence, the selectivity mechanisms of the Mamba blocks efficiently capture global context, which is then passed to transformer blocks that enable a more nuanced parsing of local context. The result is a dramatic reduction in memory usage and latency with no apparent tradeoff in performance.
Granite 4.0 Tiny doubles down on these efficiency gains by implementing them within a compact, fine-grained mixture of experts (MoE) framework, comprising 7B total parameters and 64 experts, yielding 1B active parameters at inference time. Further details are available in Granite 4.0 Tiny Preview’s Hugging Face model card.
One of the more tantalizing aspects of SSM-based language models is the theoretical ability to handle infinitely long sequences. But due to practical constraints, the word “theoretical” typically does a lot of heavy lifting.
One of those constraints, especially for hybrid-SSM models, comes from the positional encoding (PE) used to represent information about the order of words. PE adds computational steps, and research has shown that models using PE techniques such as rotary positional encoding (RoPE) struggle to generalize to sequences longer than what they’ve seen in training.3
The Granite 4.0 architecture uses no positional encoding (NoPE). Our testing demonstrates convincingly that this has had no adverse effect on long-context performance. At present, we have already validated Tiny Preview’s long-context performance for at least 128K tokens, and expect to validate similar performance on significantly longer context lengths by the time the model has completed training and post-training. It’s worth noting that a key challenge in definitively validating performance on tasks in the neighborhood of 1M-token context is the scarcity of suitable datasets.
The other practical constraint on Mamba context length is compute. Linear scaling is better than quadratic scaling, but it still adds up eventually. Here again, Granite 4.0 Tiny has two key advantages:
Put simply, the Granite 4.0 MoE architecture itself places no constraints on context length. It can go as far as your hardware will take you.
We’re excited to continue pre-training Granite 4.0 Tiny, given such promising results so early in the process. We’re also excited to apply our learnings from post-training Granite 3.3, particularly with regard to reasoning capabilities and complex instruction following, to the new models. Like its predecessors in Granite 3.2 and Granite 3.3, Granite 4.0 Tiny Preview offers toggleable
More information about new developments in the Granite Series will be presented at IBM Think 2025, as well as in the weeks and months to follow.
Achieve over 90% cost savings with Granite's smaller and open models, designed for developer efficiency. These enterprise-ready models deliver exceptional performance against safety benchmarks and across a wide range of enterprise tasks from cybersecurity to RAG.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
1. For instance, the theoretical RAM consumption for 5 concurrent sessions at up to 128K context length is suitable for an NVIDIA GeForce RTX 3060 GPU with 12GB of RAM, which—as of 29 April 2025—starts at $329. (Source: NVIDIA).
2. Memory reduction calculated at 128K context length and 16 concurrent sessions.
3. "The Impact of Positional Encoding on Length Generalization in Transformers," arXiv, 6 November 2023