Legacy Concept Lab

Beam Search & Structured Decoding

Beam search is standard for machine translation and structured generation tasks

Concept 51 of 100Core TrainingPhase 9

#51Beam SearchCore Training

key equation\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i | y_{<i})

Phase 9: Advanced architectures & generationConcept 51 of 100

Why It Matters for Modern Models

Beam search is standard for machine translation and structured generation tasks
Explains the "greedy vs search" tradeoff: greedy is fast but suboptimal, beam explores more
Understanding beam search clarifies why speculative decoding and constrained generation work

What is still poorly explained in textbooks and papers:

Beam search is NOT sampling—it approximates argmax, which can produce boring/repetitive text
Larger beam ≠ always better: "beam search curse" where larger beams give worse translations
For open-ended generation (chat), sampling usually beats beam search for quality

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i | y_{<i})

Beam search maintains top- $B$ partial sequences:

At step $t$ , expand each beam by all vocab tokens, keep top- $B$ by score:

\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i | y_{<i})

Length normalization prevents bias toward short sequences:

\text{score}_{\text{norm}} = \frac{1}{t^\alpha} \sum_{i=1}^{t} \log p(y_i | y_{<i})

Diverse beam search adds diversity penalty to avoid similar beams.

Sutskever, Vinyals, Le2014NeurIPS

Explore this concept from different angles — like a mathematician would.