1Core Training

Maximum Likelihood, Cross-Entropy & KL Divergence

Canonical Papers

A Neural Probabilistic Language Model

Bengio et al.2003JMLR
Read paper →

Core Mathematics

Almost every frontier model is trained by (approximate) maximum likelihood:

maxθi=1nlogpθ(x(i))\max_\theta \sum_{i=1}^n \log p_\theta(x^{(i)})

Equivalently, minimize empirical cross-entropy between data distribution p^\hat p and model pθp_\theta:

minθH(p^,pθ)=minθ[Exp^logpθ(x)]\min_\theta H(\hat p, p_\theta) = \min_\theta \left[ -\mathbb E_{x\sim \hat p} \log p_\theta(x) \right]

This is the same as minimizing KL divergence:

KL(p^pθ)=Ep^logp^(x)pθ(x)\mathrm{KL}(\hat p \,\|\, p_\theta) = \mathbb E_{\hat p} \log \frac{\hat p(x)}{p_\theta(x)}

For autoregressive LMs, factorization comes from the chain rule:

pθ(x1,,xT)=t=1Tpθ(xtx<t)p_\theta(x_1,\dots,x_T) = \prod_{t=1}^T p_\theta(x_t \mid x_{<t})
Key Equation
minθH(p^,pθ)=Exp^logpθ(x)\min_\theta H(\hat p, p_\theta) = -\mathbb E_{x\sim \hat p} \log p_\theta(x)

Interactive Visualization

Why It Matters for Modern Models

  • Pretraining for GPT-4, Claude, Gemini, Llama: next-token cross-entropy over web/text/code
  • Stable Diffusion & Sora optimize likelihood-style surrogates (noise-prediction MSE = reparameterized ELBO)
  • Reward models in RLHF trained via cross-entropy on human preference data

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Why KL direction matters (KL(p_data || p_θ) vs reverse) and how it biases models toward covering modes vs being conservative
  • How cross-entropy shapes behavior under distribution shift (hallucinations: model picks "likely token" under learned p_θ even when input is off-manifold)

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.