23Efficiency

⚡MoE Serving & Scheduling: Token Dispatch, All-to-All, Disaggregated Parallelism

Canonical Papers

Zhu et al.2025arXiv

Li et al.2024arXiv

Jiang et al.2024arXiv

MoE serving transforms "sparse compute" into a systems scheduling problem—routing creates skew, fragmentation, and communication overhead.

Token dispatch (all-to-all pattern):

Per MoE layer, tokens route to experts creating a permutation. For batch $T$ tokens routing to $E$ experts:

\text{Bytes}_{\text{comm}} \approx 2 \cdot T \cdot k \cdot d_{\text{model}} \cdot b

where $k$ is top-k, $d_{\text{model}}$ is hidden dimension, $b$ is bytes/element. This is dispatch + combine communication.

Straggler latency (skew problem):

With routing skew, per-expert load varies. Layer time is dominated by the busiest expert:

t_{\text{layer}} = \max_{e \in [1,E]} t_e = \max_e \left(\frac{n_e \cdot d_{\text{model}} \cdot d_{\text{ffn}}}{\text{FLOPS}_e}\right)

where $n_e$ is tokens routed to expert $e$ . Even with low average load, tail latency kills throughput.

Disaggregation tradeoff:

MegaScale-Infer separates attention from expert FFNs on different GPU pools:

\text{Utilization}_{\text{total}} = f(\text{attention-pool}, \text{expert-pool}, \text{pipeline-depth})

Trading communication for specialization and better resource allocation.

Key Equation

t_{\text{layer}} = \max_{e} \frac{n_e \cdot d_{\text{model}} \cdot d_{\text{ffn}}}{\text{FLOPS}_e}

MoE inference is not just "less FLOPs"—routing creates skew, and the busiest expert/GPU determines latency (straggler problem)
Every MoE layer does all-to-all communication (dispatch tokens to experts, combine results)—this is memory/network bound, not compute bound
MegaScale-Infer shows 1.90× per-GPU throughput by disaggregating attention vs expert FFNs—separating workloads enables specialization
Mixtral, DeepSeek-V2/V3, DBRX all face these serving constraints—understanding MoE serving explains real production deployment decisions
After #21 (serving) and #22 (MoE routing), #23 explains what actually breaks when you combine them at scale

What is still poorly explained in textbooks and papers:

Sparsity buys FLOPs but sells you a scheduling problem—load becomes bursty and skewed, busiest expert dictates latency
MoE inference is two collective communications per layer (dispatch/combine)—not just matmuls, but all-to-all patterns + synchronization
Batch size is not "free" in MoE decoding—you need enough tokens per expert for GEMM efficiency, but you're constrained by KV cache + SLOs
Expert parallelism changes what scales—you're scaling placement, routing-induced traffic, and microbatching strategy, not just tensor parallelism
Disaggregation is the new architecture knob—attention and expert FFNs can be scaled/deployed differently with pipelining to keep both busy

Explore this concept from different angles — like a mathematician would.

🔄 Same Technique

⚠️ Breaks When

🔧 Invented to Fix

≈ Analogy