Legacy Concept Lab

Gradient Clipping & Explosion Prevention

Essential for training LLMs—without clipping, gradients explode on certain batches

Concept 59 of 100OptimizationPhase 3

#59Grad ClipOptimization

key equation\tilde{g} = \min(1, c / \|g\|) \cdot g

Phase 3: Optimization & generalizationConcept 59 of 100

Why It Matters for Modern Models

Essential for training LLMs—without clipping, gradients explode on certain batches
GPT-3 used gradient clipping of 1.0; it's a standard hyperparameter in all LLM training
Explains why very deep networks (100+ layers) require careful architectural choices

What is still poorly explained in textbooks and papers:

Clipping preserves gradient direction while bounding step size—you still go the right way
Bad batches (outliers) cause gradient spikes; clipping prevents single batches from destabilizing training
Gradient norm is a useful diagnostic: spikes often precede training instabilities

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\tilde{g} = \min(1, c / \|g\|) \cdot g

Gradient norm clipping: Scale gradient if norm exceeds threshold:

\tilde{g} = \begin{cases} g & \text{if } \|g\| \leq c \\ c \cdot \frac{g}{\|g\|} & \text{if } \|g\| > c \end{cases}

Value clipping (per-coordinate): $\tilde{g}_i = \text{clip}(g_i, -c, c)$

Why gradients explode: In deep networks, gradients are products of Jacobians:

\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h_L} \prod_{l=2}^{L} \frac{\partial h_l}{\partial h_{l-1}} \frac{\partial h_1}{\partial W_1}

If $\|\frac{\partial h_l}{\partial h_{l-1}}\| > 1$ , gradients grow exponentially with depth.

Pascanu, Mikolov, Bengio2013ICML

Brown et al.2020NeurIPS

Explore this concept from different angles — like a mathematician would.