Loss Landscapes, Sharpness & Flat Minima
Canonical Papers
Deep Learning without Poor Local Minima
Read paper →Sharpness-Aware Minimization for Efficiently Improving Generalization
Read paper →Core Mathematics
SAM objective:
In practice: 1. Take a single gradient step to find "worst-case" perturbation: 2. Update using gradient at the perturbed weights:
Theoretical results show certain deep networks' loss surfaces have no "bad" local minima (all local minima are global or near-global).
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Frontier models rely on implicit flat-minima bias (mini-batch SGD, data augmentation, weight decay) for generalization
- Fine-tuning and RLHF pipelines sometimes adopt SAM-like ideas to stabilize training
Missing Intuition
What is still poorly explained in textbooks and papers:
- Most expositions show "sharp vs flat" in 2D, but we lack intuitive stories for high-dimensional anisotropic sharpness
- How mode connectivity (many minima connected by low-loss paths) interacts with flatness and weight averaging