Legacy Concept Lab

Energy-Based Models & Score Functions

Unifies discriminative and generative modeling: classifier logits ARE energy differences

Concept 53 of 100Generative ModelsPhase 10

#53EBMsGenerative Models

key equationp_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta}

Phase 10: Mathematical foundations & information geometryConcept 53 of 100

Why It Matters for Modern Models

Unifies discriminative and generative modeling: classifier logits ARE energy differences
GAN discriminators can be viewed as learning energy functions
Score-based diffusion models are EBMs trained via denoising score matching

What is still poorly explained in textbooks and papers:

The partition function Z is intractable—all EBM training tricks avoid computing it
Energy = "how wrong this input looks"—low energy = high probability
MCMC sampling from EBMs is slow; diffusion sidesteps this by learning the denoising path directly

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta}

Energy-based models define probability via unnormalized energy:

p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta}, \quad Z_\theta = \int \exp(-E_\theta(x)) dx

The score function is the gradient of log-probability:

s_\theta(x) = \nabla_x \log p_\theta(x) = -\nabla_x E_\theta(x)

Contrastive divergence training:

\nabla_\theta \mathcal{L} = \mathbb{E}_{p_{data}}[\nabla_\theta E_\theta(x)] - \mathbb{E}_{p_\theta}[\nabla_\theta E_\theta(x)]

LeCun et al.2006MIT Press

Explore this concept from different angles — like a mathematician would.