Legacy Concept Lab
Layer Normalization & RMSNorm
LayerNorm is in every transformer—it stabilizes training by controlling activation scales
#54LayerNormCore Training
key equation
\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sigma}Phase 2: Architecture fundamentalsConcept 54 of 100
Why It Matters for Modern Models
- LayerNorm is in every transformer—it stabilizes training by controlling activation scales
- Pre-norm vs post-norm placement affects gradient flow and training stability
- RMSNorm saves compute (no mean) with similar quality—used in modern efficient LLMs
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- LayerNorm makes networks robust to scale: you can multiply weights by constant without changing output
- The learned γ, β parameters let the network "undo" normalization where needed
- Pre-norm (normalize before attention/MLP) is more stable for deep networks
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
LayerNorm normalizes across features for each example:
where , .
RMSNorm (used in LLaMA, etc.) skips mean centering: