Legacy Concept Lab

Batch Normalization

Enabled training of very deep CNNs—ResNets wouldn't work without it

Concept 61 of 100Core TrainingPhase 3

#61BatchNormCore Training

key equation\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

Phase 3: Optimization & generalizationConcept 61 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

Original "covariate shift" explanation is likely wrong—it works by smoothing the loss landscape
BatchNorm couples examples in a batch: each example's gradient depends on batchmates
Small batch sizes → noisy statistics → training instability. That's why LLMs use LayerNorm instead

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

BatchNorm normalizes across the batch dimension:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

where $\mu_B = \frac{1}{m}\sum_{i=1}^m x_i$ and $\sigma_B^2 = \frac{1}{m}\sum_{i=1}^m (x_i - \mu_B)^2$ .

Scale and shift with learned parameters:

y_i = \gamma \hat{x}_i + \beta

At inference: use running averages of $\mu$ and $\sigma^2$ from training.

Ioffe & Szegedy2015ICML

Santurkar et al.2018NeurIPS

Explore this concept from different angles — like a mathematician would.