MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism
Canonical Papers
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
Read paper →Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling
Read paper →Mixtral of Experts
Read paper →Core Mathematics
MoE serving transforms "sparse compute" into a systems scheduling problem—routing creates skew, fragmentation, and communication overhead.
Token dispatch (all-to-all pattern):
Per MoE layer, tokens route to experts creating a permutation. For batch tokens routing to experts:
where is top-k, is hidden dimension, is bytes/element. This is dispatch + combine communication.
Straggler latency (skew problem):
With routing skew, per-expert load varies. Layer time is dominated by the busiest expert:
where is tokens routed to expert . Even with low average load, tail latency kills throughput.
Disaggregation tradeoff:
MegaScale-Infer separates attention from expert FFNs on different GPU pools:
Trading communication for specialization and better resource allocation.
Interactive Visualization
Why It Matters for Modern Models
- MoE inference is not just "less FLOPs"—routing creates skew, and the busiest expert/GPU determines latency (straggler problem)
- Every MoE layer does all-to-all communication (dispatch tokens to experts, combine results)—this is memory/network bound, not compute bound
- MegaScale-Infer shows 1.90× per-GPU throughput by disaggregating attention vs expert FFNs—separating workloads enables specialization
- Mixtral, DeepSeek-V2/V3, DBRX all face these serving constraints—understanding MoE serving explains real production deployment decisions
- After #21 (serving) and #22 (MoE routing), #23 explains what actually breaks when you combine them at scale
Missing Intuition
What is still poorly explained in textbooks and papers:
- Sparsity buys FLOPs but sells you a scheduling problem—load becomes bursty and skewed, busiest expert dictates latency
- MoE inference is two collective communications per layer (dispatch/combine)—not just matmuls, but all-to-all patterns + synchronization
- Batch size is not "free" in MoE decoding—you need enough tokens per expert for GEMM efficiency, but you're constrained by KV cache + SLOs
- Expert parallelism changes what scales—you're scaling placement, routing-induced traffic, and microbatching strategy, not just tensor parallelism
- Disaggregation is the new architecture knob—attention and expert FFNs can be scaled/deployed differently with pipelining to keep both busy