20Efficiency

Speculative Decoding: Lossless Multi-Token Generation

Canonical Papers

Fast Inference from Transformers via Speculative Decoding

Leviathan, Kalman, Matias2023ICML
Read paper →

Sequoia: Scalable and Robust Speculative Decoding

Chen et al.2024NeurIPS
Read paper →

SpecInfer: Accelerating Generative Large Language Model Serving

Miao et al.2024ASPLOS
Read paper →

Core Mathematics

Autoregressive LLMs generate one token per forward pass—the sequential bottleneck. Speculative decoding breaks this by using a fast draft model to propose multiple tokens, then verifying them in parallel with the target model.

Key insight: This is lossless—output distribution matches the target model exactly via rejection sampling.

Acceptance probability for each draft token xix_i:

αi=min ⁣(1,pi[xi]qi[xi])\alpha_i = \min\!\left(1, \frac{p_i[x_i]}{q_i[x_i]}\right)

where pip_i is the target model distribution and qiq_i is the draft distribution at position ii.

Residual sampling when rejected:

xiNormalize ⁣(max(0,piqi))x_i \sim \text{Normalize}\!\left(\max(0, p_i - q_i)\right)

This ensures output distribution is exactly pp, not qq—the method is mathematically lossless.

Speedup comes from accepting multiple draft tokens at once when the draft model is accurate. With acceptance rate α\alpha and draft length kk:

Speedupαk1+(1α)k\text{Speedup} \approx \frac{\alpha \cdot k}{1 + (1-\alpha) \cdot k}
Key Equation
αi=min ⁣(1,pi[xi]qi[xi])\alpha_i = \min\!\left(1, \frac{p_i[x_i]}{q_i[x_i]}\right)

Interactive Visualization

Why It Matters for Modern Models

  • Production inference systems (DeepMind Gemini, Google Vertex AI) use speculative decoding to reduce latency while maintaining exact output quality
  • Enables 2-3× speedups on common workloads without changing model quality—pure systems optimization
  • Tree-based speculation (Sequoia) extends this to multiple branches, achieving >3× speedups
  • Critical for interactive applications where latency matters—chat, code completion, real-time agents
  • Combines with #19 (efficient attention) since verification step is attention-heavy and benefits from FlashAttention/GQA

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Why is this lossless? The rejection sampling mechanism ensures target distribution p is preserved exactly—not approximation
  • Draft model doesn't need to be good everywhere, just similar enough to target on the current input—specialization matters
  • Verification is parallel attention over k tokens, so #19 optimizations directly improve speculative decoding throughput
  • Tree speculation trades compute for coverage—explore multiple futures instead of one linear sequence

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.