4Optimization

Loss Landscapes, Sharpness & Flat Minima

Canonical Papers

Deep Learning without Poor Local Minima

Kawaguchi2016NeurIPS
Read paper →

Sharpness-Aware Minimization for Efficiently Improving Generalization

Foret et al.2020ICLR
Read paper →

Core Mathematics

SAM objective:

minwmaxϵpρL(w+ϵ)\min_w \max_{\|\epsilon\|_p \le \rho} L(w + \epsilon)

In practice: 1. Take a single gradient step to find "worst-case" perturbation: ϵ(w)ρL(w)L(w)2\epsilon(w) \approx \rho \frac{\nabla L(w)}{\|\nabla L(w)\|_2} 2. Update using gradient at the perturbed weights: L(w+ϵ(w))\nabla L(w+\epsilon(w))

Theoretical results show certain deep networks' loss surfaces have no "bad" local minima (all local minima are global or near-global).

Key Equation
minwmaxϵpρL(w+ϵ)\min_w \max_{\|\epsilon\|_p \le \rho} L(w + \epsilon)

Interactive Visualization

Why It Matters for Modern Models

  • Frontier models rely on implicit flat-minima bias (mini-batch SGD, data augmentation, weight decay) for generalization
  • Fine-tuning and RLHF pipelines sometimes adopt SAM-like ideas to stabilize training

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Most expositions show "sharp vs flat" in 2D, but we lack intuitive stories for high-dimensional anisotropic sharpness
  • How mode connectivity (many minima connected by low-loss paths) interacts with flatness and weight averaging

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.