Legacy Concept Lab

Backpropagation & Automatic Differentiation

Every modern neural network is trained via backprop—it is the fundamental algorithm enabling gradient-based learning

Concept 35 of 100OptimizationPhase 3

#35BackpropOptimization

key equation\delta_\ell = \delta_{\ell+1} \cdot \frac{\partial h_{\ell+1}}{\partial h_\ell}

Phase 3: Optimization & generalizationConcept 35 of 100

Why It Matters for Modern Models

Every modern neural network is trained via backprop—it is the fundamental algorithm enabling gradient-based learning
Memory cost of storing activations explains why activation checkpointing exists and why large models need gradient accumulation
Understanding forward/backward asymmetry explains why inference is cheap but training is expensive

What is still poorly explained in textbooks and papers:

Reverse-mode is optimal when outputs << parameters (typical in ML); forward-mode would require one pass per parameter
The computation graph is built dynamically in PyTorch—this is why torch.no_grad() saves memory, not just compute
Vanishing/exploding gradients arise from repeated multiplication of Jacobians—residual connections fix this by adding identity

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\delta_\ell = \delta_{\ell+1} \cdot \frac{\partial h_{\ell+1}}{\partial h_\ell}

The chain rule computes gradients efficiently via reverse-mode autodiff:

For composite function $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$ :

\frac{\partial L}{\partial \theta_\ell} = \frac{\partial L}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdots \frac{\partial h_{\ell+1}}{\partial h_\ell} \cdot \frac{\partial h_\ell}{\partial \theta_\ell}

The backward pass propagates sensitivities:

\delta_\ell = \frac{\partial L}{\partial h_\ell} = \delta_{\ell+1} \cdot \frac{\partial h_{\ell+1}}{\partial h_\ell}

Computational cost: one forward pass + one backward pass = O(2 × forward).
Memory cost: must store all intermediate activations $h_1, \ldots, h_L$ .

Rumelhart, Hinton, Williams1986Nature

Baydin et al.2018JMLR

Explore this concept from different angles — like a mathematician would.