31Core Training

🔀State Space Models & Hybrid Architectures: Mamba-2, Jamba, Griffin

Canonical Papers

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao & Gu2024ICML
Read paper →

Jamba: Hybrid Transformer-Mamba Language Models

AI21 Labs2024arXiv
Read paper →

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Google DeepMind2024Technical Report
Read paper →

Core Mathematics

SSMs replace global attention with recurrences/structured kernels, or mix both (local attention + recurrence) for long-context efficiency. Key: linear-time sequence modeling.

SSM recurrence:

ht+1=Aht+Bxt,yt=Chth_{t+1} = Ah_t + Bx_t, \quad y_t = Ch_t

State update is linear—constant memory, O(T)O(T) time.

Equivalent convolution/kernel view:

yt=k=0tKkxtk,Kk=CAkBy_t = \sum_{k=0}^{t} K_k x_{t-k}, \quad K_k = CA^{k}B

SSMs can be viewed as attention with structured kernel—both compute weighted sums over past.

Hybrid gating intuition (generic template):

yt=gtytSSM+(1gt)ytAttn(local)y_t = g_t \odot y_t^{\text{SSM}} + (1-g_t) \odot y_t^{\text{Attn(local)}}

Captures Griffin-style "recurrence + local attention" hybrids—best of both worlds.

Key Equation
ht+1=Aht+Bxt,yt=Chth_{t+1} = Ah_t + Bx_t, \quad y_t = Ch_t

Interactive Visualization

Why It Matters for Modern Models

  • Long context (#30) exposes transformer's Achilles heel (quadratic attention + KV cache)—SSMs are the architectural escape hatch
  • Mamba-2/SSD frames *Structured State-Space Duality*: SSMs and attention are dual, both compute weighted sums but SSMs do it via linear recurrence
  • Jamba: hybrid Transformer-Mamba + MoE for capacity, reports strong performance up to 256K tokens—shows hybrids dominate not pure SSMs
  • RecurrentGemma/Griffin: mix linear recurrences with local attention for efficiency + long-sequence suitability
  • Why SSMs work for language now is **selectivity** (input-dependent behavior), not just O(T) complexity—otherwise you get bland smoothing kernel

Missing Intuition

What is still poorly explained in textbooks and papers:

  • SSMs can be taught as "attention with structured kernel"—both compute weighted sums over past, SSMs do it via recurrence/scan
  • Reason "SSMs work for language now" is selectivity (input-dependent behavior)—otherwise you get smoothing that can't do sharp retrieval
  • Hybrids exist because you want: local attention for short-range syntax + recurrence/SSM for long-range memory—neither alone is optimal
  • Constant state memory is key advantage—KV cache grows with T, SSM state stays fixed size, enabling truly unbounded context
  • Hardware-friendly is critical—linear recurrence maps to efficient scans/cumsum, while attention needs custom kernels (FlashAttention)

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.