5Optimization

Overparameterization & Generalization, Double Descent

Canonical Papers

Understanding Deep Learning Requires Rethinking Generalization

Zhang et al.2017ICLR
Read paper →

Reconciling Modern Machine-Learning Practice and the Bias–Variance Trade-off

Belkin et al.2019PNAS
Read paper →

Core Mathematics

Classical learning theory: test error ~ U-shaped function of model capacity. Empirically, modern nets show double descent: as parameters cross the interpolation threshold (0 training error), test error drops again as capacity keeps growing.

In simple linear models:

E[(yy^)2]=bias2+variance+σ2\mathbb E[(y - \hat y)^2] = \text{bias}^2 + \text{variance} + \sigma^2

Variance explodes near interpolation, then decreases as overparameterization plus implicit regularization kicks in.

Key Equation
E[(yy^)2]=bias2+variance+σ2\mathbb E[(y - \hat y)^2] = \text{bias}^2 + \text{variance} + \sigma^2

Interactive Visualization

Why It Matters for Modern Models

  • GPT-4-class models are deep in the overparameterized regime: parameters ≫ training examples, yet generalize well
  • Chinchilla-style scaling laws model how test loss scales with both capacity and data

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Visual intuition for why larger nets can generalize better (not just "they memorize more")
  • How implicit biases of different optimizers (SGD vs Adam) select among infinite interpolating solutions

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.