Legacy Concept Lab
Backpropagation & Automatic Differentiation
Every modern neural network is trained via backprop—it is the fundamental algorithm enabling gradient-based learning
#35BackpropOptimization
key equation
\delta_\ell = \delta_{\ell+1} \cdot \frac{\partial h_{\ell+1}}{\partial h_\ell}Phase 3: Optimization & generalizationConcept 35 of 100
Why It Matters for Modern Models
- Every modern neural network is trained via backprop—it is the fundamental algorithm enabling gradient-based learning
- Memory cost of storing activations explains why activation checkpointing exists and why large models need gradient accumulation
- Understanding forward/backward asymmetry explains why inference is cheap but training is expensive
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Reverse-mode is optimal when outputs << parameters (typical in ML); forward-mode would require one pass per parameter
- The computation graph is built dynamically in PyTorch—this is why torch.no_grad() saves memory, not just compute
- Vanishing/exploding gradients arise from repeated multiplication of Jacobians—residual connections fix this by adding identity
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
The chain rule computes gradients efficiently via reverse-mode autodiff:
For composite function :
The backward pass propagates sensitivities:
Computational cost: one forward pass + one backward pass = O(2 × forward).
Memory cost: must store all intermediate activations .