21Efficiency

⚡LLM Serving at Scale: Prefill, Decode & Continuous Batching

Canonical Papers

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon et al.2023SOSP

Read paper →

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized LLM Serving

Zhong et al.2024OSDI

Read paper →

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Prabhu et al.2024arXiv

Read paper →

Core Mathematics

Production LLM inference is not just "run attention"—it's multi-user scheduling under latency constraints with KV-cache memory as the bottleneck.

Latency decomposition (the serving mental model):

\text{Latency} \approx \underbrace{\text{TTFT}}_{\text{time to first token}} + (T_{\text{out}}-1) \cdot \underbrace{\text{TPOT}}_{\text{time per output token}}

TTFT is dominated by prefill (parallel processing of prompt), TPOT by decode (sequential generation with KV cache reads).

Goodput (what systems optimize):

\text{Goodput} = \text{Throughput} \times \Pr\!\left(\text{TTFT} \le S_{\text{TTFT}} \wedge \text{TPOT} \le S_{\text{TPOT}}\right)

Production systems maximize goodput under service-level objectives (SLOs).

KV cache paging/fragmentation cost—let $P$ be page/block size:

\text{KV-mem}(T) \propto \left\lceil \frac{T}{P} \right\rceil \cdot P \quad \Rightarrow \quad \text{waste}(T) \propto \left(\left\lceil \frac{T}{P} \right\rceil \cdot P - T\right)

PagedAttention allocates KV in fixed blocks to eliminate fragmentation and enable dynamic memory management.

Key Equation

\text{Latency} \approx \text{TTFT} + (T_{\text{out}}-1) \cdot \text{TPOT}

Interactive Visualization

Why It Matters for Modern Models

vLLM/PagedAttention is the production standard for open-source LLM serving—near-zero memory waste, dynamic batching
Prefill and decode are fundamentally different workloads (compute-bound parallel vs memory-bound sequential)—DistServe shows 4.48× speedup by separating them
Continuous batching keeps GPUs busy under variable request arrivals—static batching wastes resources waiting for all requests to finish
KV cache memory grows with context length and limits batch size—paging makes this predictable and efficient
Disaggregation (separate prefill/decode clusters) is the 2024-2025 frontier for production serving architecture

Missing Intuition

What is still poorly explained in textbooks and papers:

LLM inference comprises two different workloads: prefill (big parallel matmuls) and decode (tiny matmuls + huge KV reads)—mixing them creates interference
Continuous batching is not "bigger batches"—it's maintaining a rolling set of active sequences as requests arrive/complete
KV cache is not just memory usage—it's a scheduler constraint that determines max batch size
Paging solves fragmentation: requests grow/shrink dynamically, contiguous allocation wastes memory, blocks/pages enable efficient sharing

Connections

Prerequisites

⊗Attention ⚙Efficient Attention ⏩Speculative Decoding

Enables

🔀MoE ⚡MoE Serving 📏Long Context

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

⚠️ Breaks When

📏KV dominates SLO→Long Context ⏩Low acceptance→Speculative Decoding

≈ Analogy

⚡Scheduling under skew→MoE Serving ↗Budget allocation→Scaling

🔧 Invented to Fix

⊗Production serving→Attention ⏩Latency reduction→Speculative Decoding

🔄 Same Technique

⚙Blocked KV memory layout→Efficient Attention