Legacy Concept Lab

Weight Decay & AdamW: Decoupled Regularization

AdamW is the standard optimizer for LLM training—GPT, LLaMA, etc. all use it

Concept 58 of 100OptimizationPhase 3

#58AdamWOptimization

key equation\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta \cdot \text{Adam\_step}

Phase 3: Optimization & generalizationConcept 58 of 100

Why It Matters for Modern Models

AdamW is the standard optimizer for LLM training—GPT, LLaMA, etc. all use it
The distinction between L2 and weight decay is a common source of bugs in training
Weight decay strength is one of the most important hyperparameters for generalization

What is still poorly explained in textbooks and papers:

Adam rescales gradients, so L2 regularization gets rescaled too—breaking the intended effect
AdamW applies weight decay after the Adam update, preserving the regularization strength
Weight decay = "prefer simpler models"—it keeps weights small unless data strongly supports them

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta \cdot \text{Adam\_step}

L2 regularization adds penalty to loss:

L_{reg} = L + \frac{\lambda}{2} \|\theta\|^2

\nabla L_{reg} = \nabla L + \lambda \theta

Weight decay directly shrinks weights:

\theta_{t+1} = (1 - \eta \lambda) \theta_t - \eta \nabla L

For SGD: L2 regularization = weight decay. For Adam: they differ!

AdamW decouples weight decay from gradient updates:

m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla L

v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla L)^2

\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

Loshchilov & Hutter2019ICLR

Loshchilov & Hutter2018arXiv

Explore this concept from different angles — like a mathematician would.