Mamba-2 vs Griffin vs RWKV-6: SSM Architecture Benchmark
Source: Dev.to
The quadratic complexity of attention — $O(n^2)$ for sequence length $n$ — stopped being theoretical the moment context windows hit 128 k tokens. State Space Models (SSMs) promise $O(n)$ complexity without sacrificing quality, but three architectures dominate 2026: Mamba‑2, Griffin, and RWKV‑6.
I benchmarked all three on the same 1.3 B‑parameter budget. The results challenged what I thought I knew about attention alternatives.

Photo by Andrey Matveev on Pexels
What Makes SSMs Different From Transformers
Transformers compute attention scores between every token pair. For a 10 k token sequence, that’s 100 M comparisons. SSMs instead maintain a fixed‑size hidden state that gets updated sequentially:
$$ h_t = \bar{A},h_{t-1} + \bar{B},x_t $$
$$ y_t = C,h_t $$
The matrices $\bar{A}, \bar{B}, C$ are learned, but crucially $h_t$ doesn’t grow with sequence length. You process 10 tokens or 100 k tokens with the same memory footprint.