23Efficiency

MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism

Canonical Papers

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Zhu et al.2025arXiv
Read paper →

Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling

Li et al.2024arXiv
Read paper →

Mixtral of Experts

Jiang et al.2024arXiv
Read paper →

Core Mathematics

MoE serving transforms "sparse compute" into a systems scheduling problem—routing creates skew, fragmentation, and communication overhead.

Token dispatch (all-to-all pattern):

Per MoE layer, tokens route to experts creating a permutation. For batch TT tokens routing to EE experts:

Bytescomm2Tkdmodelb\text{Bytes}_{\text{comm}} \approx 2 \cdot T \cdot k \cdot d_{\text{model}} \cdot b

where kk is top-k, dmodeld_{\text{model}} is hidden dimension, bb is bytes/element. This is dispatch + combine communication.

Straggler latency (skew problem):

With routing skew, per-expert load varies. Layer time is dominated by the busiest expert:

tlayer=maxe[1,E]te=maxe(nedmodeldffnFLOPSe)t_{\text{layer}} = \max_{e \in [1,E]} t_e = \max_e \left(\frac{n_e \cdot d_{\text{model}} \cdot d_{\text{ffn}}}{\text{FLOPS}_e}\right)

where nen_e is tokens routed to expert ee. Even with low average load, tail latency kills throughput.

Disaggregation tradeoff:

MegaScale-Infer separates attention from expert FFNs on different GPU pools:

Utilizationtotal=f(attention-pool,expert-pool,pipeline-depth)\text{Utilization}_{\text{total}} = f(\text{attention-pool}, \text{expert-pool}, \text{pipeline-depth})

Trading communication for specialization and better resource allocation.

Key Equation
tlayer=maxenedmodeldffnFLOPSet_{\text{layer}} = \max_{e} \frac{n_e \cdot d_{\text{model}} \cdot d_{\text{ffn}}}{\text{FLOPS}_e}

Interactive Visualization

Why It Matters for Modern Models

  • MoE inference is not just "less FLOPs"—routing creates skew, and the busiest expert/GPU determines latency (straggler problem)
  • Every MoE layer does all-to-all communication (dispatch tokens to experts, combine results)—this is memory/network bound, not compute bound
  • MegaScale-Infer shows 1.90× per-GPU throughput by disaggregating attention vs expert FFNs—separating workloads enables specialization
  • Mixtral, DeepSeek-V2/V3, DBRX all face these serving constraints—understanding MoE serving explains real production deployment decisions
  • After #21 (serving) and #22 (MoE routing), #23 explains what actually breaks when you combine them at scale

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Sparsity buys FLOPs but sells you a scheduling problem—load becomes bursty and skewed, busiest expert dictates latency
  • MoE inference is two collective communications per layer (dispatch/combine)—not just matmuls, but all-to-all patterns + synchronization
  • Batch size is not "free" in MoE decoding—you need enough tokens per expert for GEMM efficiency, but you're constrained by KV cache + SLOs
  • Expert parallelism changes what scales—you're scaling placement, routing-induced traffic, and microbatching strategy, not just tensor parallelism
  • Disaggregation is the new architecture knob—attention and expert FFNs can be scaled/deployed differently with pipelining to keep both busy

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.