Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism
Canonical Papers
Mixtral of Experts
Read paper →DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Read paper →DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Read paper →Core Mathematics
MoE changes the economics of inference: you can scale total parameters dramatically while keeping activated compute per token constant.
Router probabilities (per token, per layer):
Given token hidden state , a linear router produces expert probabilities:
Top-k gating (sparse activation):
Let . Only those experts run, and outputs are mixed:
Load-balancing loss (prevents expert collapse):
Uses frequency of expert selection () and average gating score ():
This regularizer prevents routers from collapsing into a small subset of experts—surprisingly easy to implement wrong in distributed training.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Mixtral (8×7B) activates only 2 experts per token—"lots of total params, few active params" is the MoE bargain
- DeepSeek-V2: 236B total params / 21B activated per token with long context—shows MoE is co-designed with serving constraints
- Grok-1 (314B MoE), Qwen MoE variants—MoE is a real design choice in production frontier models, not theoretical
- MoE trades FLOPs for memory footprint + communication—routing tokens dynamically makes serving harder (token batches fragment by expert)
- After #21 teaches serving efficiency, #22 shows how frontier labs change the model itself to keep serving economically viable at scale
Missing Intuition
What is still poorly explained in textbooks and papers:
- MoE is "sparse compute, dense memory"—you compute only k experts but need all expert weights available (or sharded), trading FLOPs for memory + communication
- The router is just a classifier trained by backprop—it learns a partition of token space, and without regularization happily collapses to a few experts
- Load balancing is subtle: balance at wrong granularity destroys specialization (micro-batch balancing pushes toward within-sequence uniformity)
- MoE is not an ensemble—it's conditional computation where different tokens see different subnetworks, changing training dynamics and failure modes
- Distributed MoE ≈ all-to-all communication disguised as an MLP—tokens permute across devices twice per layer (dispatch/compute/combine)