14Scaling & Alignment

Scaling Laws & Emergent Abilities

Canonical Papers

Scaling Laws for Neural Language Models

Kaplan et al.2020arXiv
Read paper →

Training Compute-Optimal Large Language Models

Hoffmann et al.2022NeurIPS (Chinchilla)
Read paper →

Emergent Abilities of Large Language Models

Wei et al.2022TMLR
Read paper →

Core Mathematics

Test loss LL obeys approximate power laws:

L(N,D,C)L+aNα+bDβL(N, D, C) \approx L_\infty + a N^{-\alpha} + b D^{-\beta}

where NN = parameters, DD = data, CC = compute; α,β\alpha,\beta are exponents.

Chinchilla rule: for fixed compute, optimal frontier scales roughly as DND \propto N — don't over-scale parameters without matching data.

Some capabilities (chain-of-thought, few-shot reasoning) appear suddenly once scale crosses a threshold — "emergent abilities."

Key Equation
L(N,D,C)L+aNα+bDβL(N, D, C) \approx L_\infty + a N^{-\alpha} + b D^{-\beta}

Interactive Visualization

Why It Matters for Modern Models

  • GPT-3.5/4, Claude, Gemini, Llama were all designed with these scaling behaviors in mind
  • Sora & SDXL apply similar scaling-law reasoning for image/video diffusion backbones

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Why power-law scaling happens (statistical physics analogies, information-theoretic arguments)
  • Visual, interactive plots showing evolving task-specific performance vs scale

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.