Legacy Concept Lab

SGD & Momentum: The Workhorses of Optimization

Momentum is still used to train most vision models and is the default for many frameworks

Concept 57 of 100OptimizationPhase 3

#57SGD+MomentumOptimization

key equationv_{t+1} = \mu v_t + \nabla L, \quad \theta_{t+1} = \theta_t - \eta v_{t+1}

Phase 3: Optimization & generalizationConcept 57 of 100

Why It Matters for Modern Models

Momentum is still used to train most vision models and is the default for many frameworks
Understanding momentum explains why adaptive methods (Adam) can be worse for generalization
Nesterov momentum provides optimal convergence rates for convex optimization

What is still poorly explained in textbooks and papers:

Momentum = exponential moving average of gradients, so it smooths over mini-batch noise
Heavy ball analogy: momentum lets you roll through small bumps and ravines in the loss landscape
Nesterov looks ahead: "if I keep going this way, what would the gradient be?"

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

v_{t+1} = \mu v_t + \nabla L, \quad \theta_{t+1} = \theta_t - \eta v_{t+1}

Vanilla SGD:

\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)

Momentum (Polyak):

v_{t+1} = \mu v_t + \nabla L(\theta_t)

\theta_{t+1} = \theta_t - \eta v_{t+1}

Nesterov Momentum (look-ahead gradient):

v_{t+1} = \mu v_t + \nabla L(\theta_t - \eta \mu v_t)

\theta_{t+1} = \theta_t - \eta v_{t+1}

With momentum $\mu \approx 0.9$ , effective learning rate is $\eta / (1 - \mu) = 10\eta$ .

Sutskever et al.2013ICML

Nesterov1983Soviet Mathematics Doklady

Explore this concept from different angles — like a mathematician would.