14Scaling & Alignment

↗Scaling Laws & Emergent Abilities

Canonical Papers

Kaplan et al.2020arXiv

Hoffmann et al.2022NeurIPS (Chinchilla)

Wei et al.2022TMLR

Test loss $L$ obeys approximate power laws:

L(N, D, C) \approx L_\infty + a N^{-\alpha} + b D^{-\beta}

where $N$ = parameters, $D$ = data, $C$ = compute; $\alpha,\beta$ are exponents.

Chinchilla rule: for fixed compute, optimal frontier scales roughly as $D \propto N$ — don't over-scale parameters without matching data.

Some capabilities (chain-of-thought, few-shot reasoning) appear suddenly once scale crosses a threshold — "emergent abilities."

Key Equation

L(N, D, C) \approx L_\infty + a N^{-\alpha} + b D^{-\beta}

GPT-3.5/4, Claude, Gemini, Llama were all designed with these scaling behaviors in mind
Sora & SDXL apply similar scaling-law reasoning for image/video diffusion backbones

What is still poorly explained in textbooks and papers:

Why power-law scaling happens (statistical physics analogies, information-theoretic arguments)
Visual, interactive plots showing evolving task-specific performance vs scale

Explore this concept from different angles — like a mathematician would.

⚠️ Breaks When

≈ Analogy

🔧 Invented to Fix

↔️ Mathematical Dual