3Optimization

Adam & Adaptive Gradient Methods

Canonical Papers

Adam: A Method for Stochastic Optimization

Kingma & Ba2014ICLR
Read paper →

On the Convergence of Adam and Beyond

Reddi et al.2018ICLR
Read paper →

Core Mathematics

For gradient gt=θLt(θt)g_t = \nabla_\theta L_t(\theta_t):

mt=β1mt1+(1β1)gtvt=β2vt1+(1β2)gt2m^t=mt/(1β1t),v^t=vt/(1β2t)θt+1=θtαm^tv^t+ε\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1) g_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\ \hat m_t &= m_t / (1-\beta_1^t),\quad \hat v_t = v_t / (1-\beta_2^t) \\ \theta_{t+1} &= \theta_t - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon} \end{aligned}

Convergence analyses show that naïve Adam can diverge on simple convex problems and motivate variants like AMSGrad.

Key Equation
θt+1=θtαm^tv^t+ε\theta_{t+1} = \theta_t - \alpha \frac{\hat m_t}{\sqrt{\hat v_t} + \varepsilon}

Interactive Visualization

Why It Matters for Modern Models

  • Large foundation models almost universally use Adam or AdamW for pretraining and fine-tuning
  • RLHF and diffusion training use Adam-style optimizers to handle noisy gradients and widely varying scales

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Geometric explanation of how per-coordinate scaling with 1/√vₜ interacts with overparameterized nets—why it sometimes hurts generalization vs SGD
  • How Adam bias-correction and exponential averaging interact with curriculum and non-stationary objectives (e.g. RLHF)

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.