Overparameterization & Generalization, Double Descent
Canonical Papers
Understanding Deep Learning Requires Rethinking Generalization
Read paper →Reconciling Modern Machine-Learning Practice and the Bias–Variance Trade-off
Read paper →Core Mathematics
Classical learning theory: test error ~ U-shaped function of model capacity. Empirically, modern nets show double descent: as parameters cross the interpolation threshold (0 training error), test error drops again as capacity keeps growing.
In simple linear models:
Variance explodes near interpolation, then decreases as overparameterization plus implicit regularization kicks in.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- GPT-4-class models are deep in the overparameterized regime: parameters ≫ training examples, yet generalize well
- Chinchilla-style scaling laws model how test loss scales with both capacity and data
Missing Intuition
What is still poorly explained in textbooks and papers:
- Visual intuition for why larger nets can generalize better (not just "they memorize more")
- How implicit biases of different optimizers (SGD vs Adam) select among infinite interpolating solutions