5Optimization

∪Overparameterization & Generalization, Double Descent

Canonical Papers

Understanding Deep Learning Requires Rethinking Generalization

Zhang et al.2017ICLR

Read paper →

Reconciling Modern Machine-Learning Practice and the Bias–Variance Trade-off

Belkin et al.2019PNAS

Read paper →

Core Mathematics

Classical learning theory: test error ~ U-shaped function of model capacity. Empirically, modern nets show double descent: as parameters cross the interpolation threshold (0 training error), test error drops again as capacity keeps growing.

In simple linear models:

\mathbb E[(y - \hat y)^2] = \text{bias}^2 + \text{variance} + \sigma^2

Variance explodes near interpolation, then decreases as overparameterization plus implicit regularization kicks in.

Key Equation

\mathbb E[(y - \hat y)^2] = \text{bias}^2 + \text{variance} + \sigma^2

Interactive Visualization

Why It Matters for Modern Models

GPT-4-class models are deep in the overparameterized regime: parameters ≫ training examples, yet generalize well
Chinchilla-style scaling laws model how test loss scales with both capacity and data

Missing Intuition

What is still poorly explained in textbooks and papers:

Visual intuition for why larger nets can generalize better (not just "they memorize more")
How implicit biases of different optimizers (SGD vs Adam) select among infinite interpolating solutions

Connections

Prerequisites

⌇Sharpness

Enables

ΘNTK ↗Scaling

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

⚠️ Breaks When

↗Small data→Scaling ∀Bias-variance puzzle→Theory ⌇Beyond interpolation threshold→Sharpness ΘRich regime dominates→NTK

≈ Analogy

⌇Geometry ↔ Generalization→Sharpness