Maximum Likelihood, Cross-Entropy & KL Divergence
Canonical Papers
A Neural Probabilistic Language Model
Read paper →Core Mathematics
Almost every frontier model is trained by (approximate) maximum likelihood:
Equivalently, minimize empirical cross-entropy between data distribution and model :
This is the same as minimizing KL divergence:
For autoregressive LMs, factorization comes from the chain rule:
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Pretraining for GPT-4, Claude, Gemini, Llama: next-token cross-entropy over web/text/code
- Stable Diffusion & Sora optimize likelihood-style surrogates (noise-prediction MSE = reparameterized ELBO)
- Reward models in RLHF trained via cross-entropy on human preference data
Missing Intuition
What is still poorly explained in textbooks and papers:
- Why KL direction matters (KL(p_data || p_θ) vs reverse) and how it biases models toward covering modes vs being conservative
- How cross-entropy shapes behavior under distribution shift (hallucinations: model picks "likely token" under learned p_θ even when input is off-manifold)