Legacy Concept Lab

Residual Connections & Skip Connections

Residuals enable training 100+ layer networks by preventing vanishing gradients

Concept 41 of 100Core TrainingPhase 2
#41ResidualsCore Training
key equationy = x + F(x)
Phase 2: Architecture fundamentalsConcept 41 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • Residuals enable training 100+ layer networks by preventing vanishing gradients
  • The "residual stream" view is foundational to mechanistic interpretability—each component writes to a shared memory
  • Without residuals, logit lens and activation patching would not work: there is no stable representation to probe

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Residual networks are "perturbations of identity"—each layer learns to make small corrections rather than full transforms
  • Deep networks with residuals behave like ensembles of shallower networks (unraveled view)
  • Pre-norm vs post-norm changes where gradients flow—modern transformers use pre-norm for stability

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
y=x+F(x)y = x + F(x)

Residual block transforms input xx by learning a perturbation:

y=x+F(x,{Wi})y = x + F(x, \{W_i\})

In transformers, this creates the residual stream:

h(l+1)=h(l)+Attn(h(l))+MLP(h(l)+Attn(h(l)))h^{(l+1)} = h^{(l)} + \text{Attn}(h^{(l)}) + \text{MLP}(h^{(l)} + \text{Attn}(h^{(l)}))

Gradient flow through L layers:

Lh(0)=Lh(L)(I+l=1LF(l)h(l1))\frac{\partial L}{\partial h^{(0)}} = \frac{\partial L}{\partial h^{(L)}} \cdot \left( I + \sum_{l=1}^{L} \frac{\partial F^{(l)}}{\partial h^{(l-1)}} \right)

The identity term II ensures gradients always have a direct path backward.

Canonical Papers

Deep Residual Learning for Image Recognition

He et al.2016CVPR
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.