Legacy Concept Lab
Beam Search & Structured Decoding
Beam search is standard for machine translation and structured generation tasks
#51Beam SearchCore Training
key equation
\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i | y_{<i})Phase 9: Advanced architectures & generationConcept 51 of 100
Why It Matters for Modern Models
- Beam search is standard for machine translation and structured generation tasks
- Explains the "greedy vs search" tradeoff: greedy is fast but suboptimal, beam explores more
- Understanding beam search clarifies why speculative decoding and constrained generation work
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Beam search is NOT sampling—it approximates argmax, which can produce boring/repetitive text
- Larger beam ≠ always better: "beam search curse" where larger beams give worse translations
- For open-ended generation (chat), sampling usually beats beam search for quality
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Beam search maintains top- partial sequences:
At step , expand each beam by all vocab tokens, keep top- by score:
Length normalization prevents bias toward short sequences:
Diverse beam search adds diversity penalty to avoid similar beams.