Decoding & Sampling: Temperature, Top-p & Inference-Time Control
Canonical Papers
The Curious Case of Neural Text Degeneration
Read paper →Locally Typical Sampling
Read paper →Classifier-Free Diffusion Guidance
Read paper →Core Mathematics
Training gives you a next-token distribution . Decoding is the (often overlooked) step that turns probabilities into actual behavior.
---
## 1) Temperature reshapes the softmax
Given logits for tokens , temperature produces:
Lower sharpens (more deterministic). Higher flattens (more exploratory).
---
## 2) Nucleus (top-p) truncation deletes the tail, then renormalizes
Let be the smallest set of tokens whose probability mass is at least :
Then sample from the truncated distribution:
This is why top-p is a behavior knob, not “formatting”: it literally changes the distribution you sample from.
---
## 3) A unifying idea: inference-time “guidance” is distribution shaping
Diffusion guidance has the same shape: it pushes samples toward a conditioning signal with a knob that trades fidelity vs diversity:
Decoding in LLMs and guidance in diffusion both do inference-time preference shaping — *without retraining*.
Interactive Visualization
Why It Matters for Modern Models
- The same model can behave radically differently in products because decoding settings differ (temperature/top_p/top_k).
- Decoding is practical control over determinism vs diversity, and over repetition/degeneration failure modes.
- Inference-time control is “free” compared to retraining: you can shape behavior without changing weights.
- Sampling choices affect reliability: low temperature can reduce variance but can also lock in the wrong answer.
- This unifies LLM sampling with diffusion guidance: both expose a knob that trades diversity for conditioning/fidelity.
Missing Intuition
What is still poorly explained in textbooks and papers:
- Decoding is not “just formatting”: it changes the effective distribution you sample from at inference time.
- “Temperature = creativity” is sloppy—temperature is distribution shaping and can increase nonsense when the prompt is off-manifold.
- Top-p is dynamic: it adapts per step to the entropy of the distribution; it’s not the same as a fixed top-k.
- Degeneration (repetition loops) is an inference pathology; decoding settings can create or fix it without any training change.
- For diffusion, “more guidance is better” is false—high guidance can push off-manifold and degrade quality.