LLM Serving at Scale: Prefill, Decode & Continuous Batching
Canonical Papers
Efficient Memory Management for Large Language Model Serving with PagedAttention
Read paper →DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving
Read paper →vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Read paper →Core Mathematics
Production LLM inference is not just "run attention"—it's multi-user scheduling under latency constraints with KV-cache memory as the bottleneck.
Latency decomposition (the serving mental model):
TTFT is dominated by prefill (parallel processing of prompt), TPOT by decode (sequential generation with KV cache reads).
Goodput (what systems optimize):
Production systems maximize goodput under service-level objectives (SLOs).
KV cache paging/fragmentation cost—let be page/block size:
PagedAttention allocates KV in fixed blocks to eliminate fragmentation and enable dynamic memory management.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- vLLM/PagedAttention is the production standard for open-source LLM serving—near-zero memory waste, dynamic batching
- Prefill and decode are fundamentally different workloads (compute-bound parallel vs memory-bound sequential)—DistServe shows 4.48× speedup by separating them
- Continuous batching keeps GPUs busy under variable request arrivals—static batching wastes resources waiting for all requests to finish
- KV cache memory grows with context length and limits batch size—paging makes this predictable and efficient
- Disaggregation (separate prefill/decode clusters) is the 2024-2025 frontier for production serving architecture
Missing Intuition
What is still poorly explained in textbooks and papers:
- LLM inference comprises two different workloads: prefill (big parallel matmuls) and decode (tiny matmuls + huge KV reads)—mixing them creates interference
- Continuous batching is not "bigger batches"—it's maintaining a rolling set of active sequences as requests arrive/complete
- KV cache is not just memory usage—it's a scheduler constraint that determines max batch size
- Paging solves fragmentation: requests grow/shrink dynamically, contiguous allocation wastes memory, blocks/pages enable efficient sharing