20Efficiency

⏩Speculative Decoding: Lossless Multi-Token Generation

Canonical Papers

Fast Inference from Transformers via Speculative Decoding

Leviathan, Kalman, Matias2023ICML

Read paper →

Sequoia: Scalable and Robust Speculative Decoding

Chen et al.2024NeurIPS

Read paper →

SpecInfer: Accelerating Generative Large Language Model Serving

Miao et al.2024ASPLOS

Read paper →

Core Mathematics

Autoregressive LLMs generate one token per forward pass—the sequential bottleneck. Speculative decoding breaks this by using a fast draft model to propose multiple tokens, then verifying them in parallel with the target model.

Key insight: This is lossless—output distribution matches the target model exactly via rejection sampling.

Acceptance probability for each draft token $x_i$ :

\alpha_i = \min\!\left(1, \frac{p_i[x_i]}{q_i[x_i]}\right)

where $p_i$ is the target model distribution and $q_i$ is the draft distribution at position $i$ .

Residual sampling when rejected:

x_i \sim \text{Normalize}\!\left(\max(0, p_i - q_i)\right)

This ensures output distribution is exactly $p$ , not $q$ —the method is mathematically lossless.

Speedup comes from accepting multiple draft tokens at once when the draft model is accurate. With acceptance rate $\alpha$ and draft length $k$ :

\text{Speedup} \approx \frac{\alpha \cdot k}{1 + (1-\alpha) \cdot k}

Key Equation

\alpha_i = \min\!\left(1, \frac{p_i[x_i]}{q_i[x_i]}\right)

Interactive Visualization

Why It Matters for Modern Models

Production inference systems (DeepMind Gemini, Google Vertex AI) use speculative decoding to reduce latency while maintaining exact output quality
Enables 2-3× speedups on common workloads without changing model quality—pure systems optimization
Tree-based speculation (Sequoia) extends this to multiple branches, achieving >3× speedups
Critical for interactive applications where latency matters—chat, code completion, real-time agents
Combines with #19 (efficient attention) since verification step is attention-heavy and benefits from FlashAttention/GQA

Missing Intuition

What is still poorly explained in textbooks and papers:

Why is this lossless? The rejection sampling mechanism ensures target distribution p is preserved exactly—not approximation
Draft model doesn't need to be good everywhere, just similar enough to target on the current input—specialization matters
Verification is parallel attention over k tokens, so #19 optimizations directly improve speculative decoding throughput
Tree speculation trades compute for coverage—explore multiple futures instead of one linear sequence

Connections

Prerequisites

ℒML/CE/KL ⊗Attention ⚡Efficiency ⚙Efficient Attention

Enables

⚡LLM Serving 🎲Decoding

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

↔️ Mathematical Dual

🎲Direct ↔ Rejection→Decoding

⚠️ Breaks When

⚡Low acceptance→LLM Serving 🎲Entropy kills acceptance→Decoding

≈ Analogy

⚡Student as draft→Efficiency 🔀Draft-verify routing→MoE

🔧 Invented to Fix

⚡Latency reduction→LLM Serving ⊗Autoregressive latency→Attention

🔄 Same Technique

🎲Accept/reject sampling→Decoding ⚡Hide latency with parallelism→MoE Serving