22Efficiency

🔀Sparse Mixture of Experts: Routing, Load Balancing & Expert Parallelism

Canonical Papers

Mixtral of Experts

Jiang et al.2024arXiv
Read paper →

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai et al.2024ACL
Read paper →

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI2024arXiv
Read paper →

Core Mathematics

MoE changes the economics of inference: you can scale total parameters dramatically while keeping activated compute per token constant.

Router probabilities (per token, per layer):

Given token hidden state hRdh \in \mathbb{R}^d, a linear router produces expert probabilities:

z=Wrh,p(eh)=softmax(z)ez = W_r h, \qquad p(e \mid h) = \mathrm{softmax}(z)_e

Top-k gating (sparse activation):

Let S=TopK(p(h),k)S = \mathrm{TopK}(p(\cdot \mid h), k). Only those experts run, and outputs are mixed:

MoE(h)=eSp~efe(h),p~e=p(eh)jSp(jh)\mathrm{MoE}(h) = \sum_{e \in S} \tilde{p}_e \cdot f_e(h), \qquad \tilde{p}_e = \frac{p(e \mid h)}{\sum_{j \in S} p(j \mid h)}

Load-balancing loss (prevents expert collapse):

Uses frequency of expert selection (fif_i) and average gating score (PiP_i):

LLB=NEi=1NEfiPi\mathcal{L}_{\text{LB}} = N_E \sum_{i=1}^{N_E} f_i \cdot P_i

This regularizer prevents routers from collapsing into a small subset of experts—surprisingly easy to implement wrong in distributed training.

Key Equation
MoE(h)=eSp~efe(h)\mathrm{MoE}(h) = \sum_{e \in S} \tilde{p}_e \cdot f_e(h)

Interactive Visualization

Why It Matters for Modern Models

  • Mixtral (8×7B) activates only 2 experts per token—"lots of total params, few active params" is the MoE bargain
  • DeepSeek-V2: 236B total params / 21B activated per token with long context—shows MoE is co-designed with serving constraints
  • Grok-1 (314B MoE), Qwen MoE variants—MoE is a real design choice in production frontier models, not theoretical
  • MoE trades FLOPs for memory footprint + communication—routing tokens dynamically makes serving harder (token batches fragment by expert)
  • After #21 teaches serving efficiency, #22 shows how frontier labs change the model itself to keep serving economically viable at scale

Missing Intuition

What is still poorly explained in textbooks and papers:

  • MoE is "sparse compute, dense memory"—you compute only k experts but need all expert weights available (or sharded), trading FLOPs for memory + communication
  • The router is just a classifier trained by backprop—it learns a partition of token space, and without regularization happily collapses to a few experts
  • Load balancing is subtle: balance at wrong granularity destroys specialization (micro-batch balancing pushes toward within-sequence uniformity)
  • MoE is not an ensemble—it's conditional computation where different tokens see different subnetworks, changing training dynamics and failure modes
  • Distributed MoE ≈ all-to-all communication disguised as an MLP—tokens permute across devices twice per layer (dispatch/compute/combine)

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.