34Core Training

🎲Decoding & Sampling: Temperature, Top-p & Inference-Time Control

Canonical Papers

The Curious Case of Neural Text Degeneration

Holtzman et al.2019ICLR
Read paper →

Locally Typical Sampling

Meister et al.2022arXiv
Read paper →

Classifier-Free Diffusion Guidance

Ho & Salimans2022NeurIPS
Read paper →

Core Mathematics

Training gives you a next-token distribution pθ(x<t)p_\theta(\cdot\mid x_{<t}). Decoding is the (often overlooked) step that turns probabilities into actual behavior.

---

## 1) Temperature reshapes the softmax

Given logits ziz_i for tokens iVi \in \mathcal V, temperature τ\tau produces:

pτ(ix<t)=ezi/τjezj/τp_\tau(i\mid x_{<t}) = \frac{e^{z_i/\tau}}{\sum_j e^{z_j/\tau}}

Lower τ\tau sharpens (more deterministic). Higher τ\tau flattens (more exploratory).

---

## 2) Nucleus (top-p) truncation deletes the tail, then renormalizes

Let SpS_p be the smallest set of tokens whose probability mass is at least pp:

Sp=min{S:iSp(i)p}S_p = \min\left\{S: \sum_{i\in S} p(i) \ge p\right\}

Then sample from the truncated distribution:

p(i)=p(i)1[iSp]jSpp(j)p'(i)=\frac{p(i)\,\mathbf{1}[i\in S_p]}{\sum_{j\in S_p} p(j)}

This is why top-p is a behavior knob, not “formatting”: it literally changes the distribution you sample from.

---

## 3) A unifying idea: inference-time “guidance” is distribution shaping

Diffusion guidance has the same shape: it pushes samples toward a conditioning signal with a knob that trades fidelity vs diversity:

ϵguided=ϵuncond+w(ϵcondϵuncond)\epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + w\big(\epsilon_{\text{cond}}-\epsilon_{\text{uncond}}\big)

Decoding in LLMs and guidance in diffusion both do inference-time preference shaping — *without retraining*.

Key Equation
p(i)=p(i)1[iSp]jSpp(j)p'(i)=\frac{p(i)\,\mathbf{1}[i\in S_p]}{\sum_{j\in S_p} p(j)}

Interactive Visualization

Why It Matters for Modern Models

  • The same model can behave radically differently in products because decoding settings differ (temperature/top_p/top_k).
  • Decoding is practical control over determinism vs diversity, and over repetition/degeneration failure modes.
  • Inference-time control is “free” compared to retraining: you can shape behavior without changing weights.
  • Sampling choices affect reliability: low temperature can reduce variance but can also lock in the wrong answer.
  • This unifies LLM sampling with diffusion guidance: both expose a knob that trades diversity for conditioning/fidelity.

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Decoding is not “just formatting”: it changes the effective distribution you sample from at inference time.
  • “Temperature = creativity” is sloppy—temperature is distribution shaping and can increase nonsense when the prompt is off-manifold.
  • Top-p is dynamic: it adapts per step to the entropy of the distribution; it’s not the same as a fixed top-k.
  • Degeneration (repetition loops) is an inference pathology; decoding settings can create or fix it without any training change.
  • For diffusion, “more guidance is better” is false—high guidance can push off-manifold and degrade quality.

Connections