Legacy Concept Lab
Mixture-of-Depths
Adaptive compute: "think harder" only when needed
#95MoDEfficiency
key equation
h^{\ell+1}_t = \text{Block}_\ell(h^\ell_t) \cdot \mathbf{1}_{t \in S_\ell} + h^\ell_t \cdot \mathbf{1}_{t \notin S_\ell}Phase 13: Cutting-edge 2024-2025 researchConcept 95 of 100
Why It Matters for Modern Models
- Adaptive compute: "think harder" only when needed
- Like MoE but routing tokens to layers, not experts
- Predictable FLOPs budget enables efficient deployment
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Some tokens need deep processing, others can skip layers
- Router learns which tokens are "important"
- Complement to MoE: sparse width (MoE) + sparse depth (MoD)
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Route tokens to different depths. At layer , score per token:
Train with explicit compute constraint (predictable FLOPs).