Legacy Concept Lab

Weight Decay & AdamW: Decoupled Regularization

AdamW is the standard optimizer for LLM training—GPT, LLaMA, etc. all use it

Concept 58 of 100OptimizationPhase 3
#58AdamWOptimization
key equation\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta \cdot \text{Adam\_step}
Phase 3: Optimization & generalizationConcept 58 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • AdamW is the standard optimizer for LLM training—GPT, LLaMA, etc. all use it
  • The distinction between L2 and weight decay is a common source of bugs in training
  • Weight decay strength is one of the most important hyperparameters for generalization

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Adam rescales gradients, so L2 regularization gets rescaled too—breaking the intended effect
  • AdamW applies weight decay after the Adam update, preserving the regularization strength
  • Weight decay = "prefer simpler models"—it keeps weights small unless data strongly supports them

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
θt+1=(1ηλ)θtηAdam_step\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta \cdot \text{Adam\_step}

L2 regularization adds penalty to loss:

Lreg=L+λ2θ2L_{reg} = L + \frac{\lambda}{2} \|\theta\|^2
Lreg=L+λθ\nabla L_{reg} = \nabla L + \lambda \theta

Weight decay directly shrinks weights:

θt+1=(1ηλ)θtηL\theta_{t+1} = (1 - \eta \lambda) \theta_t - \eta \nabla L

For SGD: L2 regularization = weight decay. For Adam: they differ!

AdamW decouples weight decay from gradient updates:

mt=β1mt1+(1β1)Lm_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla L
vt=β2vt1+(1β2)(L)2v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla L)^2
θt+1=θtη(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

Canonical Papers

Decoupled Weight Decay Regularization

Loshchilov & Hutter2019ICLR
Read paper →

Fixing Weight Decay Regularization in Adam

Loshchilov & Hutter2018arXiv
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.