Legacy Concept Lab
SGD & Momentum: The Workhorses of Optimization
Momentum is still used to train most vision models and is the default for many frameworks
#57SGD+MomentumOptimization
key equation
v_{t+1} = \mu v_t + \nabla L, \quad \theta_{t+1} = \theta_t - \eta v_{t+1}Phase 3: Optimization & generalizationConcept 57 of 100
Why It Matters for Modern Models
- Momentum is still used to train most vision models and is the default for many frameworks
- Understanding momentum explains why adaptive methods (Adam) can be worse for generalization
- Nesterov momentum provides optimal convergence rates for convex optimization
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Momentum = exponential moving average of gradients, so it smooths over mini-batch noise
- Heavy ball analogy: momentum lets you roll through small bumps and ravines in the loss landscape
- Nesterov looks ahead: "if I keep going this way, what would the gradient be?"
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Vanilla SGD:
Momentum (Polyak):
Nesterov Momentum (look-ahead gradient):
With momentum , effective learning rate is .