Legacy Concept Lab

Learning Rate Schedules: Warmup, Decay & Cycling

LR schedule is one of the highest-leverage hyperparameters—same model can fail or succeed based on schedule

Concept 47 of 100OptimizationPhase 3
#47LR SchedulesOptimization
key equation\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)
Phase 3: Optimization & generalizationConcept 47 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • LR schedule is one of the highest-leverage hyperparameters—same model can fail or succeed based on schedule
  • Warmup prevents early instability: Adam moments are biased at start, large LR can diverge
  • Cosine decay + warmup is the default for LLM pretraining (GPT, LLaMA, etc.)

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Why warmup helps: gradients are noisy/wrong at random init; small steps let moments stabilize
  • Cosine vs linear decay: cosine spends more time at high LR (exploration) before annealing (exploitation)
  • The "LR range test" finds optimal peak LR by training briefly at increasing LR until loss spikes

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
ηt=ηmin+12(ηmaxηmin)(1+cos(πtT))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Warmup ramps LR from 0 to peak over TwT_w steps:

ηt=ηmaxtTw,tTw\eta_t = \eta_{\max} \cdot \frac{t}{T_w}, \quad t \leq T_w

Cosine decay smoothly anneals:

ηt=ηmin+12(ηmaxηmin)(1+cos(tTwTTwπ))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t - T_w}{T - T_w}\pi\right)\right)

1/√t decay (classical):

ηt=η0t\eta_t = \frac{\eta_0}{\sqrt{t}}

Modern LLM training typically uses: warmup → constant → cosine decay.

Canonical Papers

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov & Hutter2017ICLR
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.