Scaling Laws & Emergent Abilities
Canonical Papers
Scaling Laws for Neural Language Models
Read paper →Training Compute-Optimal Large Language Models
Read paper →Emergent Abilities of Large Language Models
Read paper →Core Mathematics
Test loss obeys approximate power laws:
where = parameters, = data, = compute; are exponents.
Chinchilla rule: for fixed compute, optimal frontier scales roughly as — don't over-scale parameters without matching data.
Some capabilities (chain-of-thought, few-shot reasoning) appear suddenly once scale crosses a threshold — "emergent abilities."
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- GPT-3.5/4, Claude, Gemini, Llama were all designed with these scaling behaviors in mind
- Sora & SDXL apply similar scaling-law reasoning for image/video diffusion backbones
Missing Intuition
What is still poorly explained in textbooks and papers:
- Why power-law scaling happens (statistical physics analogies, information-theoretic arguments)
- Visual, interactive plots showing evolving task-specific performance vs scale