Legacy Concept Lab

Layer Normalization & RMSNorm

LayerNorm is in every transformer—it stabilizes training by controlling activation scales

Concept 54 of 100Core TrainingPhase 2

#54LayerNormCore Training

key equation\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sigma}

Phase 2: Architecture fundamentalsConcept 54 of 100

Why It Matters for Modern Models

LayerNorm is in every transformer—it stabilizes training by controlling activation scales
Pre-norm vs post-norm placement affects gradient flow and training stability
RMSNorm saves compute (no mean) with similar quality—used in modern efficient LLMs

What is still poorly explained in textbooks and papers:

LayerNorm makes networks robust to scale: you can multiply weights by constant without changing output
The learned γ, β parameters let the network "undo" normalization where needed
Pre-norm (normalize before attention/MLP) is more stable for deep networks

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sigma}

LayerNorm normalizes across features for each example:

\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta

where $\mu = \frac{1}{d}\sum_i x_i$ , $\sigma = \sqrt{\frac{1}{d}\sum_i(x_i - \mu)^2}$ .

RMSNorm (used in LLaMA, etc.) skips mean centering:

\text{RMSNorm}(x) = \gamma \odot \frac{x}{\text{RMS}(x) + \epsilon}, \quad \text{RMS}(x) = \sqrt{\frac{1}{d}\sum_i x_i^2}

Ba, Kiros, Hinton2016arXiv

Zhang & Sennrich2019NeurIPS

Explore this concept from different angles — like a mathematician would.