1Core Training

ℒMaximum Likelihood, Cross-Entropy & KL Divergence

Canonical Papers

Bengio et al.2003JMLR

Almost every frontier model is trained by (approximate) maximum likelihood:

\max_\theta \sum_{i=1}^n \log p_\theta(x^{(i)})

Equivalently, minimize empirical cross-entropy between data distribution $\hat p$ and model $p_\theta$ :

\min_\theta H(\hat p, p_\theta) = \min_\theta \left[ -\mathbb E_{x\sim \hat p} \log p_\theta(x) \right]

This is the same as minimizing KL divergence:

\mathrm{KL}(\hat p \,\|\, p_\theta) = \mathbb E_{\hat p} \log \frac{\hat p(x)}{p_\theta(x)}

For autoregressive LMs, factorization comes from the chain rule:

p_\theta(x_1,\dots,x_T) = \prod_{t=1}^T p_\theta(x_t \mid x_{<t})

Key Equation

\min_\theta H(\hat p, p_\theta) = -\mathbb E_{x\sim \hat p} \log p_\theta(x)

Pretraining for GPT-4, Claude, Gemini, Llama: next-token cross-entropy over web/text/code
Stable Diffusion & Sora optimize likelihood-style surrogates (noise-prediction MSE = reparameterized ELBO)
Reward models in RLHF trained via cross-entropy on human preference data

What is still poorly explained in textbooks and papers:

Why KL direction matters (KL(p_data || p_θ) vs reverse) and how it biases models toward covering modes vs being conservative
How cross-entropy shapes behavior under distribution shift (hallucinations: model picks "likely token" under learned p_θ even when input is off-manifold)

Explore this concept from different angles — like a mathematician would.

🔄 Same Technique

↔️ Mathematical Dual

🔧 Invented to Fix

⚠️ Breaks When