Legacy Concept Lab

Mixture-of-Depths

Adaptive compute: "think harder" only when needed

Concept 95 of 100EfficiencyPhase 13

#95MoDEfficiency

key equationh^{\ell+1}_t = \text{Block}_\ell(h^\ell_t) \cdot \mathbf{1}_{t \in S_\ell} + h^\ell_t \cdot \mathbf{1}_{t \notin S_\ell}

Phase 13: Cutting-edge 2024-2025 researchConcept 95 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

h^{\ell+1}_t = \text{Block}_\ell(h^\ell_t) \cdot \mathbf{1}_{t \in S_\ell} + h^\ell_t \cdot \mathbf{1}_{t \notin S_\ell}

Route tokens to different depths. At layer $\ell$ , score $g_\ell(t)$ per token:

S_\ell = \text{TopK}(\{g_\ell(t)\}_{t=1}^n, k)

h^{\ell+1}_t = \begin{cases} \text{Block}_\ell(h^\ell_t) & t \in S_\ell \\ h^\ell_t & \text{otherwise} \end{cases}

Train with explicit compute constraint $k$ (predictable FLOPs).

Raposo et al.2024arXiv

Explore this concept from different angles — like a mathematician would.