Legacy Concept Lab
Weight Decay & AdamW: Decoupled Regularization
AdamW is the standard optimizer for LLM training—GPT, LLaMA, etc. all use it
#58AdamWOptimization
key equation
\theta_{t+1} = (1 - \eta\lambda)\theta_t - \eta \cdot \text{Adam\_step}Phase 3: Optimization & generalizationConcept 58 of 100
Why It Matters for Modern Models
- AdamW is the standard optimizer for LLM training—GPT, LLaMA, etc. all use it
- The distinction between L2 and weight decay is a common source of bugs in training
- Weight decay strength is one of the most important hyperparameters for generalization
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Adam rescales gradients, so L2 regularization gets rescaled too—breaking the intended effect
- AdamW applies weight decay after the Adam update, preserving the regularization strength
- Weight decay = "prefer simpler models"—it keeps weights small unless data strongly supports them
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
L2 regularization adds penalty to loss:
Weight decay directly shrinks weights:
For SGD: L2 regularization = weight decay. For Adam: they differ!
AdamW decouples weight decay from gradient updates: