Speculative Decoding: Lossless Multi-Token Generation
Canonical Papers
Fast Inference from Transformers via Speculative Decoding
Read paper →Sequoia: Scalable and Robust Speculative Decoding
Read paper →SpecInfer: Accelerating Generative Large Language Model Serving
Read paper →Core Mathematics
Autoregressive LLMs generate one token per forward pass—the sequential bottleneck. Speculative decoding breaks this by using a fast draft model to propose multiple tokens, then verifying them in parallel with the target model.
Key insight: This is lossless—output distribution matches the target model exactly via rejection sampling.
Acceptance probability for each draft token :
where is the target model distribution and is the draft distribution at position .
Residual sampling when rejected:
This ensures output distribution is exactly , not —the method is mathematically lossless.
Speedup comes from accepting multiple draft tokens at once when the draft model is accurate. With acceptance rate and draft length :
Interactive Visualization
Why It Matters for Modern Models
- Production inference systems (DeepMind Gemini, Google Vertex AI) use speculative decoding to reduce latency while maintaining exact output quality
- Enables 2-3× speedups on common workloads without changing model quality—pure systems optimization
- Tree-based speculation (Sequoia) extends this to multiple branches, achieving >3× speedups
- Critical for interactive applications where latency matters—chat, code completion, real-time agents
- Combines with #19 (efficient attention) since verification step is attention-heavy and benefits from FlashAttention/GQA
Missing Intuition
What is still poorly explained in textbooks and papers:
- Why is this lossless? The rejection sampling mechanism ensures target distribution p is preserved exactly—not approximation
- Draft model doesn't need to be good everywhere, just similar enough to target on the current input—specialization matters
- Verification is parallel attention over k tokens, so #19 optimizations directly improve speculative decoding throughput
- Tree speculation trades compute for coverage—explore multiple futures instead of one linear sequence