34Core Training

🎲Decoding & Sampling: Temperature, Top-p & Inference-Time Control

Canonical Papers

The Curious Case of Neural Text Degeneration

Holtzman et al.2019ICLR

Read paper →

Locally Typical Sampling

Meister et al.2022arXiv

Read paper →

Classifier-Free Diffusion Guidance

Ho & Salimans2022NeurIPS

Read paper →

Core Mathematics

Training gives you a next-token distribution $p_\theta(\cdot\mid x_{<t})$ . Decoding is the (often overlooked) step that turns probabilities into actual behavior.

---

## 1) Temperature reshapes the softmax

Given logits $z_i$ for tokens $i \in \mathcal V$ , temperature $\tau$ produces:

p_\tau(i\mid x_{<t}) = \frac{e^{z_i/\tau}}{\sum_j e^{z_j/\tau}}

Lower $\tau$ sharpens (more deterministic). Higher $\tau$ flattens (more exploratory).

---

## 2) Nucleus (top-p) truncation deletes the tail, then renormalizes

Let $S_p$ be the smallest set of tokens whose probability mass is at least $p$ :

S_p = \min\left\{S: \sum_{i\in S} p(i) \ge p\right\}

Then sample from the truncated distribution:

p'(i)=\frac{p(i)\,\mathbf{1}[i\in S_p]}{\sum_{j\in S_p} p(j)}

This is why top-p is a behavior knob, not “formatting”: it literally changes the distribution you sample from.

---

## 3) A unifying idea: inference-time “guidance” is distribution shaping

Diffusion guidance has the same shape: it pushes samples toward a conditioning signal with a knob that trades fidelity vs diversity:

\epsilon_{\text{guided}} = \epsilon_{\text{uncond}} + w\big(\epsilon_{\text{cond}}-\epsilon_{\text{uncond}}\big)

Decoding in LLMs and guidance in diffusion both do inference-time preference shaping — *without retraining*.

Key Equation

p'(i)=\frac{p(i)\,\mathbf{1}[i\in S_p]}{\sum_{j\in S_p} p(j)}

Interactive Visualization

Why It Matters for Modern Models

The same model can behave radically differently in products because decoding settings differ (temperature/top_p/top_k).
Decoding is practical control over determinism vs diversity, and over repetition/degeneration failure modes.
Inference-time control is “free” compared to retraining: you can shape behavior without changing weights.
Sampling choices affect reliability: low temperature can reduce variance but can also lock in the wrong answer.
This unifies LLM sampling with diffusion guidance: both expose a knob that trades diversity for conditioning/fidelity.

Missing Intuition

What is still poorly explained in textbooks and papers:

Decoding is not “just formatting”: it changes the effective distribution you sample from at inference time.
“Temperature = creativity” is sloppy—temperature is distribution shaping and can increase nonsense when the prompt is off-manifold.
Top-p is dynamic: it adapts per step to the entropy of the distribution; it’s not the same as a fixed top-k.
Degeneration (repetition loops) is an inference pathology; decoding settings can create or fix it without any training change.
For diffusion, “more guidance is better” is false—high guidance can push off-manifold and degrade quality.

Connections

Prerequisites

ℒML/CE/KL ⊗Attention ∂Diffusion ⏩Speculative Decoding

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

≈ Analogy

🎚️Inference-time control→Activation Steering ∂Guidance knob→Diffusion

⚠️ Breaks When

ℒRepetition loops→ML/CE/KL ⏩Entropy kills acceptance→Speculative Decoding ⚖Tail sampling breaks alignment→RLHF

🔄 Same Technique

⏩Accept/reject sampling→Speculative Decoding ⊗Softmax selection→Attention ◎Temperature scaling→Embeddings 🔀Top-k then renormalize→MoE 👍Preference-aware generation→KTO

↔️ Mathematical Dual

⏩Direct ↔ Rejection→Speculative Decoding

🔧 Invented to Fix

ℒFix text degeneration→ML/CE/KL