Legacy Concept Lab
Gradient Clipping & Explosion Prevention
Essential for training LLMs—without clipping, gradients explode on certain batches
#59Grad ClipOptimization
key equation
\tilde{g} = \min(1, c / \|g\|) \cdot gPhase 3: Optimization & generalizationConcept 59 of 100
Why It Matters for Modern Models
- Essential for training LLMs—without clipping, gradients explode on certain batches
- GPT-3 used gradient clipping of 1.0; it's a standard hyperparameter in all LLM training
- Explains why very deep networks (100+ layers) require careful architectural choices
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Clipping preserves gradient direction while bounding step size—you still go the right way
- Bad batches (outliers) cause gradient spikes; clipping prevents single batches from destabilizing training
- Gradient norm is a useful diagnostic: spikes often precede training instabilities
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Gradient norm clipping: Scale gradient if norm exceeds threshold:
Value clipping (per-coordinate):
Why gradients explode: In deep networks, gradients are products of Jacobians:
If , gradients grow exponentially with depth.