Adam & Adaptive Gradient Methods
Canonical Papers
Adam: A Method for Stochastic Optimization
Read paper →On the Convergence of Adam and Beyond
Read paper →Core Mathematics
For gradient :
Convergence analyses show that naïve Adam can diverge on simple convex problems and motivate variants like AMSGrad.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Large foundation models almost universally use Adam or AdamW for pretraining and fine-tuning
- RLHF and diffusion training use Adam-style optimizers to handle noisy gradients and widely varying scales
Missing Intuition
What is still poorly explained in textbooks and papers:
- Geometric explanation of how per-coordinate scaling with 1/√vₜ interacts with overparameterized nets—why it sometimes hurts generalization vs SGD
- How Adam bias-correction and exponential averaging interact with curriculum and non-stationary objectives (e.g. RLHF)