Legacy Concept Lab
Batch Normalization
Enabled training of very deep CNNs—ResNets wouldn't work without it
#61BatchNormCore Training
key equation
\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}Phase 3: Optimization & generalizationConcept 61 of 100
Why It Matters for Modern Models
- Enabled training of very deep CNNs—ResNets wouldn't work without it
- Allows higher learning rates: normalization keeps activations in stable range
- The train/test discrepancy (batch stats vs running stats) causes subtle bugs
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Original "covariate shift" explanation is likely wrong—it works by smoothing the loss landscape
- BatchNorm couples examples in a batch: each example's gradient depends on batchmates
- Small batch sizes → noisy statistics → training instability. That's why LLMs use LayerNorm instead
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
BatchNorm normalizes across the batch dimension:
where and .
Scale and shift with learned parameters:
At inference: use running averages of and from training.