Legacy Concept Lab
Learning Rate Schedules: Warmup, Decay & Cycling
LR schedule is one of the highest-leverage hyperparameters—same model can fail or succeed based on schedule
#47LR SchedulesOptimization
key equation
\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)Phase 3: Optimization & generalizationConcept 47 of 100
Why It Matters for Modern Models
- LR schedule is one of the highest-leverage hyperparameters—same model can fail or succeed based on schedule
- Warmup prevents early instability: Adam moments are biased at start, large LR can diverge
- Cosine decay + warmup is the default for LLM pretraining (GPT, LLaMA, etc.)
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Why warmup helps: gradients are noisy/wrong at random init; small steps let moments stabilize
- Cosine vs linear decay: cosine spends more time at high LR (exploration) before annealing (exploitation)
- The "LR range test" finds optimal peak LR by training briefly at increasing LR until loss spikes
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Warmup ramps LR from 0 to peak over steps:
Cosine decay smoothly anneals:
1/√t decay (classical):
Modern LLM training typically uses: warmup → constant → cosine decay.