Legacy Concept Lab

Residual Connections & Skip Connections

Residuals enable training 100+ layer networks by preventing vanishing gradients

Concept 41 of 100Core TrainingPhase 2

#41ResidualsCore Training

key equationy = x + F(x)

Phase 2: Architecture fundamentalsConcept 41 of 100

Why It Matters for Modern Models

Residuals enable training 100+ layer networks by preventing vanishing gradients
The "residual stream" view is foundational to mechanistic interpretability—each component writes to a shared memory
Without residuals, logit lens and activation patching would not work: there is no stable representation to probe

What is still poorly explained in textbooks and papers:

Residual networks are "perturbations of identity"—each layer learns to make small corrections rather than full transforms
Deep networks with residuals behave like ensembles of shallower networks (unraveled view)
Pre-norm vs post-norm changes where gradients flow—modern transformers use pre-norm for stability

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

y = x + F(x)

Residual block transforms input $x$ by learning a perturbation:

y = x + F(x, \{W_i\})

In transformers, this creates the residual stream:

h^{(l+1)} = h^{(l)} + \text{Attn}(h^{(l)}) + \text{MLP}(h^{(l)} + \text{Attn}(h^{(l)}))

Gradient flow through L layers:

\frac{\partial L}{\partial h^{(0)}} = \frac{\partial L}{\partial h^{(L)}} \cdot \left( I + \sum_{l=1}^{L} \frac{\partial F^{(l)}}{\partial h^{(l-1)}} \right)

The identity term $I$ ensures gradients always have a direct path backward.

He et al.2016CVPR

Explore this concept from different angles — like a mathematician would.