Legacy Concept Lab

Mixture-of-Depths

Adaptive compute: "think harder" only when needed

Concept 95 of 100EfficiencyPhase 13
#95MoDEfficiency
key equationh^{\ell+1}_t = \text{Block}_\ell(h^\ell_t) \cdot \mathbf{1}_{t \in S_\ell} + h^\ell_t \cdot \mathbf{1}_{t \notin S_\ell}
Phase 13: Cutting-edge 2024-2025 researchConcept 95 of 100

Why It Matters for Modern Models

  • Adaptive compute: "think harder" only when needed
  • Like MoE but routing tokens to layers, not experts
  • Predictable FLOPs budget enables efficient deployment

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Some tokens need deep processing, others can skip layers
  • Router learns which tokens are "important"
  • Complement to MoE: sparse width (MoE) + sparse depth (MoD)

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
ht+1=Block(ht)1tS+ht1tSh^{\ell+1}_t = \text{Block}_\ell(h^\ell_t) \cdot \mathbf{1}_{t \in S_\ell} + h^\ell_t \cdot \mathbf{1}_{t \notin S_\ell}

Route tokens to different depths. At layer \ell, score g(t)g_\ell(t) per token:

S=TopK({g(t)}t=1n,k)S_\ell = \text{TopK}(\{g_\ell(t)\}_{t=1}^n, k)
ht+1={Block(ht)tShtotherwiseh^{\ell+1}_t = \begin{cases} \text{Block}_\ell(h^\ell_t) & t \in S_\ell \\ h^\ell_t & \text{otherwise} \end{cases}

Train with explicit compute constraint kk (predictable FLOPs).

Canonical Papers

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Raposo et al.2024arXiv
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.