Legacy Concept Lab

Learning Rate Schedules: Warmup, Decay & Cycling

LR schedule is one of the highest-leverage hyperparameters—same model can fail or succeed based on schedule

Concept 47 of 100OptimizationPhase 3

#47LR SchedulesOptimization

key equation\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Phase 3: Optimization & generalizationConcept 47 of 100

Why It Matters for Modern Models

LR schedule is one of the highest-leverage hyperparameters—same model can fail or succeed based on schedule
Warmup prevents early instability: Adam moments are biased at start, large LR can diverge
Cosine decay + warmup is the default for LLM pretraining (GPT, LLaMA, etc.)

What is still poorly explained in textbooks and papers:

Why warmup helps: gradients are noisy/wrong at random init; small steps let moments stabilize
Cosine vs linear decay: cosine spends more time at high LR (exploration) before annealing (exploitation)
The "LR range test" finds optimal peak LR by training briefly at increasing LR until loss spikes

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)

Warmup ramps LR from 0 to peak over $T_w$ steps:

\eta_t = \eta_{\max} \cdot \frac{t}{T_w}, \quad t \leq T_w

Cosine decay smoothly anneals:

\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t - T_w}{T - T_w}\pi\right)\right)

1/√t decay (classical):

\eta_t = \frac{\eta_0}{\sqrt{t}}

Modern LLM training typically uses: warmup → constant → cosine decay.

Loshchilov & Hutter2017ICLR

Explore this concept from different angles — like a mathematician would.