Legacy Concept Lab

Classifier-Free Guidance in Diffusion

CFG is why Stable Diffusion/DALL-E/Midjourney produce high-quality, on-prompt images

Concept 42 of 100Generative ModelsPhase 9

#42CFGGenerative Models

key equation\tilde{\epsilon}_\theta = \epsilon_\theta(\emptyset) + w \cdot (\epsilon_\theta(c) - \epsilon_\theta(\emptyset))

Phase 9: Advanced architectures & generationConcept 42 of 100

Why It Matters for Modern Models

CFG is why Stable Diffusion/DALL-E/Midjourney produce high-quality, on-prompt images
The guidance scale is the main user-facing knob for text-to-image quality vs diversity
Trains one model that handles both conditional and unconditional generation via dropout on conditioning

What is still poorly explained in textbooks and papers:

CFG extrapolates beyond the data distribution—high guidance can produce unrealistic but more "prompt-adherent" images
There is an optimal guidance scale: too low = ignores prompt, too high = artifacts and oversaturation
CFG relates to temperature in LLMs: both are post-hoc distribution shaping at inference time

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\tilde{\epsilon}_\theta = \epsilon_\theta(\emptyset) + w \cdot (\epsilon_\theta(c) - \epsilon_\theta(\emptyset))

CFG interpolates between conditional and unconditional scores:

\tilde{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t, \emptyset) + w \cdot (\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t, \emptyset))

where $w$ is the guidance scale (typically 3-15 for text-to-image).

Equivalently in score space:

\tilde{s}(x_t, c) = s(x_t) + w \cdot \nabla_{x_t} \log p(c|x_t)

Higher $w$ amplifies the conditioning signal, trading diversity for fidelity.

Ho & Salimans2022NeurIPS Workshop

Explore this concept from different angles — like a mathematician would.