Legacy Concept Lab

Layer Normalization & RMSNorm

LayerNorm is in every transformer—it stabilizes training by controlling activation scales

Concept 54 of 100Core TrainingPhase 2
#54LayerNormCore Training
key equation\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sigma}
Phase 2: Architecture fundamentalsConcept 54 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • LayerNorm is in every transformer—it stabilizes training by controlling activation scales
  • Pre-norm vs post-norm placement affects gradient flow and training stability
  • RMSNorm saves compute (no mean) with similar quality—used in modern efficient LLMs

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • LayerNorm makes networks robust to scale: you can multiply weights by constant without changing output
  • The learned γ, β parameters let the network "undo" normalization where needed
  • Pre-norm (normalize before attention/MLP) is more stable for deep networks

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
LN(x)=γxμσ\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sigma}

LayerNorm normalizes across features for each example:

LN(x)=γxμσ+ϵ+β\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta

where μ=1dixi\mu = \frac{1}{d}\sum_i x_i, σ=1di(xiμ)2\sigma = \sqrt{\frac{1}{d}\sum_i(x_i - \mu)^2}.

RMSNorm (used in LLaMA, etc.) skips mean centering:

RMSNorm(x)=γxRMS(x)+ϵ,RMS(x)=1dixi2\text{RMSNorm}(x) = \gamma \odot \frac{x}{\text{RMS}(x) + \epsilon}, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_i x_i^2}

Canonical Papers

Layer Normalization

Ba, Kiros, Hinton2016arXiv
Read paper →

Root Mean Square Layer Normalization

Zhang & Sennrich2019NeurIPS
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.