31Core Training

🔀State Space Models & Hybrid Architectures: Mamba-2, Jamba, Griffin

Canonical Papers

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao & Gu2024ICML

Read paper →

Jamba: Hybrid Transformer-Mamba Language Models

AI21 Labs2024arXiv

Read paper →

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Google DeepMind2024Technical Report

Read paper →

Core Mathematics

SSMs replace global attention with recurrences/structured kernels, or mix both (local attention + recurrence) for long-context efficiency. Key: linear-time sequence modeling.

SSM recurrence:

h_{t+1} = Ah_t + Bx_t, \quad y_t = Ch_t

State update is linear—constant memory, $O(T)$ time.

Equivalent convolution/kernel view:

y_t = \sum_{k=0}^{t} K_k x_{t-k}, \quad K_k = CA^{k}B

SSMs can be viewed as attention with structured kernel—both compute weighted sums over past.

Hybrid gating intuition (generic template):

y_t = g_t \odot y_t^{\text{SSM}} + (1-g_t) \odot y_t^{\text{Attn(local)}}

Captures Griffin-style "recurrence + local attention" hybrids—best of both worlds.

Key Equation

h_{t+1} = Ah_t + Bx_t, \quad y_t = Ch_t

Interactive Visualization

Why It Matters for Modern Models

Long context (#30) exposes transformer's Achilles heel (quadratic attention + KV cache)—SSMs are the architectural escape hatch
Mamba-2/SSD frames *Structured State-Space Duality*: SSMs and attention are dual, both compute weighted sums but SSMs do it via linear recurrence
Jamba: hybrid Transformer-Mamba + MoE for capacity, reports strong performance up to 256K tokens—shows hybrids dominate not pure SSMs
RecurrentGemma/Griffin: mix linear recurrences with local attention for efficiency + long-sequence suitability
Why SSMs work for language now is **selectivity** (input-dependent behavior), not just O(T) complexity—otherwise you get bland smoothing kernel

Missing Intuition

What is still poorly explained in textbooks and papers:

SSMs can be taught as "attention with structured kernel"—both compute weighted sums over past, SSMs do it via recurrence/scan
Reason "SSMs work for language now" is selectivity (input-dependent behavior)—otherwise you get smoothing that can't do sharp retrieval
Hybrids exist because you want: local attention for short-range syntax + recurrence/SSM for long-range memory—neither alone is optimal
Constant state memory is key advantage—KV cache grows with T, SSM state stays fixed size, enabling truly unbounded context
Hardware-friendly is critical—linear recurrence maps to efficient scans/cumsum, while attention needs custom kernels (FlashAttention)

Connections

Prerequisites

⊗Attention ⚙Efficient Attention 🔀MoE 📏Long Context

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

🔧 Invented to Fix

⊗Linear complexity→Attention 📏Fixed-state memory→Long Context

🔄 Same Technique

⚙Recurrent view→Efficient Attention

≈ Analogy

↻Position via state→RoPE ∂Iterative → Recurrent→Diffusion

↔️ Mathematical Dual

⊗Kernel ↔ Recurrence→Attention