Legacy Concept Lab

Gradient Clipping & Explosion Prevention

Essential for training LLMs—without clipping, gradients explode on certain batches

Concept 59 of 100OptimizationPhase 3
#59Grad ClipOptimization
key equation\tilde{g} = \min(1, c / \|g\|) \cdot g
Phase 3: Optimization & generalizationConcept 59 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • Essential for training LLMs—without clipping, gradients explode on certain batches
  • GPT-3 used gradient clipping of 1.0; it's a standard hyperparameter in all LLM training
  • Explains why very deep networks (100+ layers) require careful architectural choices

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Clipping preserves gradient direction while bounding step size—you still go the right way
  • Bad batches (outliers) cause gradient spikes; clipping prevents single batches from destabilizing training
  • Gradient norm is a useful diagnostic: spikes often precede training instabilities

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
g~=min(1,c/g)g\tilde{g} = \min(1, c / \|g\|) \cdot g

Gradient norm clipping: Scale gradient if norm exceeds threshold:

g~={gif gccggif g>c\tilde{g} = \begin{cases} g & \text{if } \|g\| \leq c \\ c \cdot \frac{g}{\|g\|} & \text{if } \|g\| > c \end{cases}

Value clipping (per-coordinate): g~i=clip(gi,c,c)\tilde{g}_i = \text{clip}(g_i, -c, c)

Why gradients explode: In deep networks, gradients are products of Jacobians:

LW1=LhLl=2Lhlhl1h1W1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h_L} \prod_{l=2}^{L} \frac{\partial h_l}{\partial h_{l-1}} \frac{\partial h_1}{\partial W_1}

If hlhl1>1\|\frac{\partial h_l}{\partial h_{l-1}}\| > 1, gradients grow exponentially with depth.

Canonical Papers

On the difficulty of training Recurrent Neural Networks

Pascanu, Mikolov, Bengio2013ICML
Read paper →

Language Models are Few-Shot Learners

Brown et al.2020NeurIPS
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.