Both Mamba and transformers have their own respective strengths, but Mamba-based models are generally superior in all matters related to memory usage and speed: per the Mamba paper, Mamba offers 5 times greater throughput than equivalent transformers.
Transformers are incredibly precise and versatile, but also incredibly demanding on computational resources. During pre-training (and fine-tuning), the memory requirements of self-attention scale quadratically with sequence length: if you double the context length of a sequence, the attention mechanism uses quadruple the resources. This “quadratic bottleneck” increasingly throttles speed and memory availability as the context window grows. During inference, their memory needs scale linearly.
During training, the memory usage of a Mamba model scales only linearly during training. More importantly, it’s memory usage during inference is constant: regardless of how many tokens the model has seen, the SSM maintains a fixed-size representation of its input history. This allows theoretically unlimited context length, constrained only by hardware limitations.
That being said, transformers’ more memory-intensive and computationally redundant method has its own advantages. For instance, research has shown that transformers still outpace both Mamba and Mamba-2 on tasks requiring in-context learning (such as few-shot prompting), copying, or long-context reasoning.