State Space Models & Hybrid Architectures: Mamba-2, Jamba, Griffin
Canonical Papers
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Read paper →Jamba: Hybrid Transformer-Mamba Language Models
Read paper →RecurrentGemma: Moving Past Transformers for Efficient Open Language Models
Read paper →Core Mathematics
SSMs replace global attention with recurrences/structured kernels, or mix both (local attention + recurrence) for long-context efficiency. Key: linear-time sequence modeling.
SSM recurrence:
State update is linear—constant memory, time.
Equivalent convolution/kernel view:
SSMs can be viewed as attention with structured kernel—both compute weighted sums over past.
Hybrid gating intuition (generic template):
Captures Griffin-style "recurrence + local attention" hybrids—best of both worlds.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Long context (#30) exposes transformer's Achilles heel (quadratic attention + KV cache)—SSMs are the architectural escape hatch
- Mamba-2/SSD frames *Structured State-Space Duality*: SSMs and attention are dual, both compute weighted sums but SSMs do it via linear recurrence
- Jamba: hybrid Transformer-Mamba + MoE for capacity, reports strong performance up to 256K tokens—shows hybrids dominate not pure SSMs
- RecurrentGemma/Griffin: mix linear recurrences with local attention for efficiency + long-sequence suitability
- Why SSMs work for language now is **selectivity** (input-dependent behavior), not just O(T) complexity—otherwise you get bland smoothing kernel
Missing Intuition
What is still poorly explained in textbooks and papers:
- SSMs can be taught as "attention with structured kernel"—both compute weighted sums over past, SSMs do it via recurrence/scan
- Reason "SSMs work for language now" is selectivity (input-dependent behavior)—otherwise you get smoothing that can't do sharp retrieval
- Hybrids exist because you want: local attention for short-range syntax + recurrence/SSM for long-range memory—neither alone is optimal
- Constant state memory is key advantage—KV cache grows with T, SSM state stays fixed size, enabling truly unbounded context
- Hardware-friendly is critical—linear recurrence maps to efficient scans/cumsum, while attention needs custom kernels (FlashAttention)