The latest AI trends, brought to you by experts
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Mamba is a neural network architecture, derived from state space models (SSMs), used for language modeling and other sequence modeling tasks. The Mamba architecture’s fast inference speed and computational efficiency, particularly for long sequences, make it the first competitive alternative to the transformer architecture for autoregressive large language models (LLMs).
Mamba models are perhaps the first deep learning architecture to rival the efficacy of transformer models on the task for which transformers originally won their fame: language modeling. Most notably, the Mamba architecture has demonstrated the capacity to match equivalently sized transformers on prominent LLM benchmark evaluations while often being significantly more efficient in terms of latency and memory requirements.
The Mamba architecture was first introduced by Tri Dao and Albert Gu in the 2023 paper, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” A year later, they followed up the original Mamba paper with another paper that both explored the connections between SSMs and transformers and presented a refined, significantly faster version of the Mamba architecture, which they dubbed Mamba-2.
Though transformers have remained the dominant mode of LLM in the 2 years following the release of the original Mamba paper, the architecture has been incorporated into a growing number of open source models. Some, such as Mistral AI’s Codestral Mamba, are pure Mamba models. Many more, including AI2I’s Jamba series and IBM Granite 4.0, are hybrid models incorporating both attention (transformer) layers and SSM (Mamba) layers. In addition to their performance-based benefits, the proliferation of Mamba-based models promises to democratize AI access by virtue of running smoothly on comparatively inexpensive hardware.
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
SSMs were originally designed to predict the next state of a continuous sequence, like an electrical signal, a weather pattern or the trajectory of a moving object, based on some input. Conceptually and mathematically, they’re related to the recurrent neural networks (RNNs) that dominated natural language processing (NLP) prior to the introduction of transformers in 2017, as well as to other machine learning algorithms including convolutional neural networks (CNNs) and hidden Markov models (HMMs).
As their name suggests, SSMs make predictions about the next state in a dynamic system by modeling the state space: a mathematical representation of all the state variables that describe the state of a system and the range of possibilities for each of those variables in tandem with one another.
An SSM takes an input sequence x(t) and maps it to a latent state representation h(t)—analogous to the hidden state of an RNN—in order to predict an output sequence y(t). At the core of any SSM are 2 equations:
The key parameters of the model are the A, B, C and D, which typically take the form of a matrix of weights. In the fields where SSMs are conventionally used, such as control theory, these matrices are often assumed to be fixed: they represent the dynamics of an established system, and the SSM is used to find the inputs x that lead to desirable outputs y. In more modern conceptions of SSMs, those matrices are themselves parameters to be optimized through machine learning. In deep learning models, those matrices are represented by the learnable weights of a neural network.
The state equation describes how the state changes. The values in matrix A determine how each state variable evolves over time if left to itself. The values in matrix B determine how the input—such as the next token in a text sequence—influences each state variable.
In language modeling, the current state represents the context of a text sequence, updated after each token. Its role is equivalent to that of the KV cache in a transformer model.
The output equation describes how the current state influences the output (as mediated by matrix C), as well as how the input influences the output directly (as mediated by matrix D). Because matrix D is essentially external to the modeling of h(t) itself, it’s often omitted from diagrams and discussions of SSMs in favor of focusing on the core matrices A, B and C.
In a Mamba LLM, the output equation is used to generate the next token.
Traditional SSMs are designed to model continuous inputs, but text sequences (and most other data modalities processed by modern deep learning models) are discrete inputs. Using SSMs to model a discrete sequence requires a means to represent its distinct, specific timesteps as part of a continuous signal.
Conceptually, discretization amounts to sampling the value of a continuous function at specific moments. This entails the introduction of a new parameter—the step size, written as ∆—that determines how long that value is sampled or “held” at each discrete time step t. Adjustments to ∆ are akin to changes to qualities such as the data’s resolution (for time series data) or frame rate (for video data). There exist multiple “discretization” methods, but most modern SSM variants (including Mamba) use the simple zero order hold (ZOH) method.
Discretizing an SSM enables it to be used like an RNN for sequence-to-sequence tasks. The parameters and equations of a discretized SSM are usually rewritten to distinguish them from their continuous-time equivalents, using the subscript notation typically employed for RNNs. In this notation, ht represents the updated state space the model will generate and ht-1 represents the state before it—that is, the current state space.
Modeling text data using standard discrete SSMs is impractical due to a number of shortcomings that they share with RNNs. Two of those shortcoming were addressed by the introduction of structured state space sequence models (or “S4 models”) by Albert Gu et al in 2021: the inefficiency of their training and their inability to model long sequences.
Though success of S4 models—and their many derivatives, such as diagonal SSMs (DSS), diagonal S4 (S4D) and H3 models—directly paved the way for what became Mamba.
The benefit of discretized SSMs being the equivalent of a specific instance of an RNN is that RNNs are extremely fast at inference. The downside, however, is that RNNs are extremely slow to train.
Fortunately, discretized SSMs have one important property distinguishing them from other RNNs: they exclusively model linear dependencies. In other words, they use only simple, straightforward multiplication and addition operations. As the S4 paper demonstrates, these simple, repeated and interdependent linear recurrences can be unrolled into a 1-dimensional convolution kernel, , that directly maps input x to output y in a single step: . This can be computed very efficiently using the fast Fourier transform.
The only “catch” is that this is only possible when every step of the entire input sequence is known. This isn’t possible during inference, but it is the case during training. A structured SSM therefore enjoys the best of both worlds: during training, it can be operated very efficiently as a CNN; during inference, it can be operated very efficiently as an RNN.
Like most RNNs, standard SSMs are inherently weak at modeling long-distance dependencies. In other words, they aren’t good at understanding the relationship between steps in a sequence that are far apart, such as words at the beginning and end of a paragraph—which makes them weak at modeling long sequences altogether.
To solve for this, Gu and his co-authors (one of whom was Tri Dao) used a technique called HiPPO—short for High-order Polynomial Projection Operators—to define the way the A and B matrices behave by structuring their intial values using a formula derived from orthogonal polynomials. This is in contrast to standard machine learning practice, in which model weights are randomly initialized at the onset of model training. For S4, Dao and Gu proposed initialization schemes derived from Legendre polynomials. They explored additional formulae in a follow-up paper, titled “How to Train Your HiPPO."1
The S4 paper notes that “simply modifying an SSM from a random matrix A to [the HiPPO Matrix] improved its performance on the sequential MNIST benchmark from 60% to 98%,” effectively solving SSMs’ long-term memory problem. Later variations of structured SSMs, such as DSS, S5 and Mamba, use different (often simpler) initialization schemes for A and B that nevertheless retain the core HiPPO principals: implementing a diagonal structure that imposes stable updates and some degree of independence between each value in the matrix.
At the core of the Mamba architecture are two innovations. The first is the selective state space model, which provides Mamba with a crucial capability previously possessed only by transformer models: the ability to selectively focus on or ignore specific parts of past input history based on their present relevance. The other is the hardware-aware parallel scan, an algorithm that optimizes the way a graphics processing unit (GPU) handles the model’s computations in its memory hierarchy to maximize speed and computational efficiency.
In transformers, this ability is provided by the attention mechanism that adjusts the attention weights that emphasize or deemphasize the influence of each previous token based on its relevance to the current input token. Ordinary SSMs are explicitly designed to map input to output using the entire input history. This is acceptable or even desirable for some sequence modeling tasks, but a significant handicap for most advanced language modeling tasks.
To remedy this inability to dynamically omit or emphasize specific parts of their input history, Dao and Gu proposed a new class of state space models with a “selective scan.” In the Mamba paper, the authors remark that they “sometimes abbreviate selective SSMs as S6 models, because they are S4 models with a selection mechanism and computed with a scan.” They nicknamed their S6-based architecture “Mamba” because, among other reasons, all of those S’s sound like a snake’s hiss.
Mamba can best be understood as a neural network architecture that contains the selective state space model at its core. For a simple analogy, Mamba is to selective SSMs as the transformer model is to the attention mechanism.
A traditional SSM has fixed dynamics: the rules governing how the hidden state evolves from one step to the next—the model parameters—are the same for every input and at every step in the sequence. This property is known as linear time invariance (LTI). To provide SSMs with the ability to selectively prioritize or deprioritize specific past information based on present context, Dao and Gu reconfigured their SSM such that the values of key model parameters will be different for different inputs.
More specifically, selective SSMs make the step size ∆t and matrices Bt and Ct direct functions of the current input token xt. This is achieved by first passing the vector embedding of xt through three parallel linear projection layers—in other words, standard feedforward neural network layers (or MLP layers). This is equivalent to how the parallel query, key and value heads generate an input’s respective Q, K and V vectors in a transformer model.
Multiplying xt’s vector embedding by the weight and bias terms in that linear projection network yields the resulting values of ∆t, Bt and Ct. The weight and bias terms of the linear projection layers themselves are learned during model pretraining on massive datasets of text samples, then (optionally) refined through subsequent fine-tuning.
Notably, no such input-based adjustments are made to the A matrix. Its role remains the same as in S4 models: to efficiently memorize the entire history of past inputs. The role of determining which parts of that history to utilize at a given moment is handled by the B and C matrices.
But once the model is no longer time-invariant, it can no longer use the convolution shortcut during training because the transition kernel is no longer constant: the crux of the selectivity mechanism is that the transition from ht-1 to ht is now dependent on context.
Instead, Mamba uses a clever workaround to achieve similar parallelization benefits. Because the SSM uses only multiplication and addition, its computations are subject to the familiar associative property of math: they can be grouped in different ways without changing the final outcome. This allows the many sequential calculations to be broken down into small, independent chunks that can be processed in parallel by a GPU through a parallel prefix sum scan.
Furthermore, the results are combined in a specific hierarchical manner that makes optimally efficient use of the different kinds of hardware memory on a GPU, using principles similar to the FlashAttention techniques—which were also developed by Tri Dao—that are now ubiquitous in modern LLMs.
Within the Mamba architecture, the S6 model serves as a module of the larger “Mamba block,” similarly to how the attention mechanism serves as a module within the larger “attention block.” It combines the S6 module with a gated neural network architecture. Mamba models typically comprise multiple Mamba blocks—that is, a series of consecutive Mamba layers in a neural network—prior to the output layer that makes the model’s final output prediction.
Before entering the Mamba block, a copy of the input is sent directly to the end as a residual connection. The purpose of the Mamba block’s inner workings is to not only determine which parts of the greater context are relevant to that input, but to determine how much that contextual information should modify the input’s original meaning.
Within the Mamba block, the original input vector is processed as follows:
A year after the original Mamba paper, Dao and Gu followed it up with “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.” This follow-up paper offered three major contributions:
The Mamba-2 algorithm is significantly faster and easier to implement than the original Mamba: the authors provided a “minimal SSD” code base that implements the selective SSM in about 25 lines of code.2 This efficiency enables Mamba-2 to use much larger hidden state dimensions without slowing the model down, allowing for larger, more powerful, more expressive models built with the architecture. In testing, Mamba-2 models definitively matched or outperformed correspondingly-sized Mamba and transformer models on a series of downstream tasks.
As the paper’s introductions states, Dao and Gu’s “main goal [was] to develop a rich body of theoretical connections between structured SSMs and variants of attention.” This yielded a new conceptual framework uniting the two, which they called “state space duality” (SSD).3 In doing so, they opened the door for Mamba to benefit from several years’ worth of exploration and optimization of the transformer architecture.
One notable benefit was the development of a Mamba equivalent of multi-head attention (MHA), in which a Mamba block can be split into multiple “Mamba heads” akin the to the multiple “attention heads” in transformers. One variant of this approach, which they deemed analogous to grouped query attention, enables even more efficiency through tensor parallelism in GPUs.
In the Mamba-2 block—which they call the parallel Mamba block (as opposed to the original “sequential” Mamba block”)—the input-dependent parameters ∆, B and C are generated in parallel at the initial projection layer. B and C, specifically, are derived by simply copying portions of xproj, rather than by multiplying xproj through dedicated linear layers. In addition to simplifying and reducing total model parameters, this parallelism enables significantly more efficient large-scale training.4
Both Mamba and transformers have their own respective strengths, but Mamba-based models are generally superior in all matters related to memory usage and speed: per the Mamba paper, Mamba offers 5 times greater throughput than equivalent transformers.
Transformers are incredibly precise and versatile, but also incredibly demanding on computational resources. During pre-training (and fine-tuning), the memory requirements of self-attention scale quadratically with sequence length: if you double the context length of a sequence, the attention mechanism uses quadruple the resources. This “quadratic bottleneck” increasingly throttles speed and memory availability as the context window grows. During inference, their memory needs scale linearly.
During training, the memory usage of a Mamba model scales only linearly during training. More importantly, it’s memory usage during inference is constant: regardless of how many tokens the model has seen, the SSM maintains a fixed-size representation of its input history. This allows theoretically unlimited context length, constrained only by hardware limitations.
That being said, transformers’ more memory-intensive and computationally redundant method has its own advantages. For instance, research has shown that transformers still outpace both Mamba and Mamba-2 on tasks requiring in-context learning (such as few-shot prompting), copying, or long-context reasoning.
Fortunately, the respective strengths of transformers and Mamba are not mutually exclusive. The Mamba-2 paper suggests that a hybrid model could outperform both pure transformers or SSMs—a notion formally validated by NVIDIA research later in 2024.5 Broadly speaking, hybrid models seem to combine the efficiency benefits of Mamba with the nuance and in-context learning performance provided by transformers’ more resource-intensive attention mechanism.
To explore this further, IBM Research collaborated with Dao and Gu, along with the University of Illinois at Urbana-Champaign (UIUC)’s Minjia Zhang, on Bamba and Bamba V2. Bamba, in turn, has informed many of the architectural elements of IBM Granite 4.0.
Research into hybrid models remains an area of active research, particularly within the open source community.
Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.
Move your applications from prototype to production with the help of our AI development solutions.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.
1. “How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections,” arXiv, 5 August 2022
2. “State Space Duality (Mamba-2) Part III – The Algorithm,” Goomba Lab, 31 May 2024
3. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,” arXiv, 31 May 2024
4. ibid
5. “An Empirical Study of Mamba-based Language Models,” arXiv, 12 June 2024