Legacy Concept Lab

SGD & Momentum: The Workhorses of Optimization

Momentum is still used to train most vision models and is the default for many frameworks

Concept 57 of 100OptimizationPhase 3
#57SGD+MomentumOptimization
key equationv_{t+1} = \mu v_t + \nabla L, \quad \theta_{t+1} = \theta_t - \eta v_{t+1}
Phase 3: Optimization & generalizationConcept 57 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • Momentum is still used to train most vision models and is the default for many frameworks
  • Understanding momentum explains why adaptive methods (Adam) can be worse for generalization
  • Nesterov momentum provides optimal convergence rates for convex optimization

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Momentum = exponential moving average of gradients, so it smooths over mini-batch noise
  • Heavy ball analogy: momentum lets you roll through small bumps and ravines in the loss landscape
  • Nesterov looks ahead: "if I keep going this way, what would the gradient be?"

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
vt+1=μvt+L,θt+1=θtηvt+1v_{t+1} = \mu v_t + \nabla L, \quad \theta_{t+1} = \theta_t - \eta v_{t+1}

Vanilla SGD:

θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)

Momentum (Polyak):

vt+1=μvt+L(θt)v_{t+1} = \mu v_t + \nabla L(\theta_t)
θt+1=θtηvt+1\theta_{t+1} = \theta_t - \eta v_{t+1}

Nesterov Momentum (look-ahead gradient):

vt+1=μvt+L(θtημvt)v_{t+1} = \mu v_t + \nabla L(\theta_t - \eta \mu v_t)
θt+1=θtηvt+1\theta_{t+1} = \theta_t - \eta v_{t+1}

With momentum μ0.9\mu \approx 0.9, effective learning rate is η/(1μ)=10η\eta / (1 - \mu) = 10\eta.

Canonical Papers

On the importance of initialization and momentum in deep learning

Sutskever et al.2013ICML
Read paper →

A method for unconstrained convex minimization problem with the rate of convergence O(1/k²)

Nesterov1983Soviet Mathematics Doklady
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.