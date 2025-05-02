One of the more tantalizing aspects of SSM-based language models is the theoretical ability to handle infinitely long sequences. But due to practical constraints, the word “theoretical” typically does a lot of heavy lifting.

One of those constraints, especially for hybrid-SSM models, comes from the positional encoding (PE) used to represent information about the order of words. PE adds computational steps, and research has shown that models using PE techniques such as rotary positional encoding (RoPE) struggle to generalize to sequences longer than what they’ve seen in training.3

The Granite 4.0 architecture uses no positional encoding (NoPE). Our testing demonstrates convincingly that this has had no adverse effect on long-context performance. At present, we have already validated Tiny Preview’s long-context performance for at least 128K tokens, and expect to validate similar performance on significantly longer context lengths by the time the model has completed training and post-training. It’s worth noting that a key challenge in definitively validating performance on tasks in the neighborhood of 1M-token context is the scarcity of suitable datasets.

The other practical constraint on Mamba context length is compute. Linear scaling is better than quadratic scaling, but it still adds up eventually. Here again, Granite 4.0 Tiny has two key advantages:

Unlike PE, NoPE doesn’t add any additional computational burden to the attention mechanism in the model’s transformer layers.

Granite 4.0 Tiny is extremely compact and efficient, leaving plenty of hardware space for linear scaling.

Put simply, the Granite 4.0 MoE architecture itself places no constraints on context length. It can go as far as your hardware will take you.