Legacy Concept Lab
Residual Connections & Skip Connections
Residuals enable training 100+ layer networks by preventing vanishing gradients
#41ResidualsCore Training
key equation
y = x + F(x)Phase 2: Architecture fundamentalsConcept 41 of 100
Why It Matters for Modern Models
- Residuals enable training 100+ layer networks by preventing vanishing gradients
- The "residual stream" view is foundational to mechanistic interpretability—each component writes to a shared memory
- Without residuals, logit lens and activation patching would not work: there is no stable representation to probe
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Residual networks are "perturbations of identity"—each layer learns to make small corrections rather than full transforms
- Deep networks with residuals behave like ensembles of shallower networks (unraveled view)
- Pre-norm vs post-norm changes where gradients flow—modern transformers use pre-norm for stability
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Residual block transforms input by learning a perturbation:
In transformers, this creates the residual stream:
Gradient flow through L layers:
The identity term ensures gradients always have a direct path backward.