Legacy Concept Lab

Batch Normalization

Enabled training of very deep CNNs—ResNets wouldn't work without it

Concept 61 of 100Core TrainingPhase 3
#61BatchNormCore Training
key equation\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
Phase 3: Optimization & generalizationConcept 61 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • Enabled training of very deep CNNs—ResNets wouldn't work without it
  • Allows higher learning rates: normalization keeps activations in stable range
  • The train/test discrepancy (batch stats vs running stats) causes subtle bugs

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Original "covariate shift" explanation is likely wrong—it works by smoothing the loss landscape
  • BatchNorm couples examples in a batch: each example's gradient depends on batchmates
  • Small batch sizes → noisy statistics → training instability. That's why LLMs use LayerNorm instead

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
x^=xμBσB2+ϵ\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

BatchNorm normalizes across the batch dimension:

x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

where μB=1mi=1mxi\mu_B = \frac{1}{m}\sum_{i=1}^m x_i and σB2=1mi=1m(xiμB)2\sigma_B^2 = \frac{1}{m}\sum_{i=1}^m (x_i - \mu_B)^2.

Scale and shift with learned parameters:

yi=γx^i+βy_i = \gamma \hat{x}_i + \beta

At inference: use running averages of μ\mu and σ2\sigma^2 from training.

Canonical Papers

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioffe & Szegedy2015ICML
Read paper →

How Does Batch Normalization Help Optimization?

Santurkar et al.2018NeurIPS
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.