22Efficiency

🔀Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism

Canonical Papers

Jiang et al.2024arXiv

Dai et al.2024ACL

DeepSeek-AI2024arXiv

MoE changes the economics of inference: you can scale total parameters dramatically while keeping activated compute per token constant.

Router probabilities (per token, per layer):

Given token hidden state $h \in \mathbb{R}^d$ , a linear router produces expert probabilities:

z = W_r h, \qquad p(e \mid h) = \mathrm{softmax}(z)_e

Top-k gating (sparse activation):

Let $S = \mathrm{TopK}(p(\cdot \mid h), k)$ . Only those experts run, and outputs are mixed:

\mathrm{MoE}(h) = \sum_{e \in S} \tilde{p}_e \cdot f_e(h), \qquad \tilde{p}_e = \frac{p(e \mid h)}{\sum_{j \in S} p(j \mid h)}

Load-balancing loss (prevents expert collapse):

Uses frequency of expert selection ( $f_i$ ) and average gating score ( $P_i$ ):

\mathcal{L}_{\text{LB}} = N_E \sum_{i=1}^{N_E} f_i \cdot P_i

This regularizer prevents routers from collapsing into a small subset of experts—surprisingly easy to implement wrong in distributed training.

Key Equation

\mathrm{MoE}(h) = \sum_{e \in S} \tilde{p}_e \cdot f_e(h)

Mixtral (8×7B) activates only 2 experts per token—"lots of total params, few active params" is the MoE bargain
DeepSeek-V2: 236B total params / 21B activated per token with long context—shows MoE is co-designed with serving constraints
Grok-1 (314B MoE), Qwen MoE variants—MoE is a real design choice in production frontier models, not theoretical
MoE trades FLOPs for memory footprint + communication—routing tokens dynamically makes serving harder (token batches fragment by expert)
After #21 teaches serving efficiency, #22 shows how frontier labs change the model itself to keep serving economically viable at scale

What is still poorly explained in textbooks and papers:

MoE is "sparse compute, dense memory"—you compute only k experts but need all expert weights available (or sharded), trading FLOPs for memory + communication
The router is just a classifier trained by backprop—it learns a partition of token space, and without regularization happily collapses to a few experts
Load balancing is subtle: balance at wrong granularity destroys specialization (micro-batch balancing pushes toward within-sequence uniformity)
MoE is not an ensemble—it's conditional computation where different tokens see different subnetworks, changing training dynamics and failure modes
Distributed MoE ≈ all-to-all communication disguised as an MLP—tokens permute across devices twice per layer (dispatch/compute/combine)

Explore this concept from different angles — like a mathematician would.

≈ Analogy

⚠️ Breaks When

🔧 Invented to Fix

🔄 Same Technique