21Efficiency

LLM Serving at Scale: Prefill, Decode & Continuous Batching

Canonical Papers

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon et al.2023SOSP
Read paper →

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving

Zhong et al.2024OSDI
Read paper →

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Prabhu et al.2024arXiv
Read paper →

Core Mathematics

Production LLM inference is not just "run attention"—it's multi-user scheduling under latency constraints with KV-cache memory as the bottleneck.

Latency decomposition (the serving mental model):

LatencyTTFTtime to first token+(Tout1)TPOTtime per output token\text{Latency} \approx \underbrace{\text{TTFT}}_{\text{time to first token}} + (T_{\text{out}}-1) \cdot \underbrace{\text{TPOT}}_{\text{time per output token}}

TTFT is dominated by prefill (parallel processing of prompt), TPOT by decode (sequential generation with KV cache reads).

Goodput (what systems optimize):

Goodput=Throughput×Pr ⁣(TTFTSTTFTTPOTSTPOT)\text{Goodput} = \text{Throughput} \times \Pr\!\left(\text{TTFT} \le S_{\text{TTFT}} \wedge \text{TPOT} \le S_{\text{TPOT}}\right)

Production systems maximize goodput under service-level objectives (SLOs).

KV cache paging/fragmentation cost—let PP be page/block size:

KV-mem(T)TPPwaste(T)(TPPT)\text{KV-mem}(T) \propto \left\lceil \frac{T}{P} \right\rceil \cdot P \quad \Rightarrow \quad \text{waste}(T) \propto \left(\left\lceil \frac{T}{P} \right\rceil \cdot P - T\right)

PagedAttention allocates KV in fixed blocks to eliminate fragmentation and enable dynamic memory management.

Key Equation
LatencyTTFT+(Tout1)TPOT\text{Latency} \approx \text{TTFT} + (T_{\text{out}}-1) \cdot \text{TPOT}

Interactive Visualization

Why It Matters for Modern Models

  • vLLM/PagedAttention is the production standard for open-source LLM serving—near-zero memory waste, dynamic batching
  • Prefill and decode are fundamentally different workloads (compute-bound parallel vs memory-bound sequential)—DistServe shows 4.48× speedup by separating them
  • Continuous batching keeps GPUs busy under variable request arrivals—static batching wastes resources waiting for all requests to finish
  • KV cache memory grows with context length and limits batch size—paging makes this predictable and efficient
  • Disaggregation (separate prefill/decode clusters) is the 2024-2025 frontier for production serving architecture

Missing Intuition

What is still poorly explained in textbooks and papers:

  • LLM inference comprises two different workloads: prefill (big parallel matmuls) and decode (tiny matmuls + huge KV reads)—mixing them creates interference
  • Continuous batching is not "bigger batches"—it's maintaining a rolling set of active sequences as requests arrive/complete
  • KV cache is not just memory usage—it's a scheduler constraint that determines max batch size
  • Paging solves fragmentation: requests grow/shrink dynamically, contiguous allocation wastes memory, blocks/pages enable efficient sharing

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.