Legacy Concept Lab

Beam Search & Structured Decoding

Beam search is standard for machine translation and structured generation tasks

Concept 51 of 100Core TrainingPhase 9
#51Beam SearchCore Training
key equation\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i | y_{<i})
Phase 9: Advanced architectures & generationConcept 51 of 100

Why It Matters for Modern Models

  • Beam search is standard for machine translation and structured generation tasks
  • Explains the "greedy vs search" tradeoff: greedy is fast but suboptimal, beam explores more
  • Understanding beam search clarifies why speculative decoding and constrained generation work

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Beam search is NOT sampling—it approximates argmax, which can produce boring/repetitive text
  • Larger beam ≠ always better: "beam search curse" where larger beams give worse translations
  • For open-ended generation (chat), sampling usually beats beam search for quality

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
score(y1:t)=i=1tlogp(yiy<i)\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i | y_{<i})

Beam search maintains top-BB partial sequences:

At step tt, expand each beam by all vocab tokens, keep top-BB by score:

score(y1:t)=i=1tlogp(yiy<i)\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i | y_{<i})

Length normalization prevents bias toward short sequences:

scorenorm=1tαi=1tlogp(yiy<i)\text{score}_{\text{norm}} = \frac{1}{t^\alpha} \sum_{i=1}^{t} \log p(y_i | y_{<i})

Diverse beam search adds diversity penalty to avoid similar beams.

Canonical Papers

Sequence to Sequence Learning with Neural Networks

Sutskever, Vinyals, Le2014NeurIPS
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.