Legacy Concept Lab

Weight Initialization: Xavier, He & µP

Bad initialization → vanishing/exploding activations → training fails immediately

Concept 48 of 100OptimizationPhase 3
#48InitOptimization
key equationW \sim \mathcal{N}(0, 2/n_{in})
Phase 3: Optimization & generalizationConcept 48 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • Bad initialization → vanishing/exploding activations → training fails immediately
  • µP is how labs scale hyperparameters: tune on small proxy, apply to full-scale training
  • Connects to NTK theory: at infinite width with proper scaling, training becomes deterministic

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • The "1/√n" scaling keeps variance constant through layers: Var(output) ≈ Var(input)
  • ReLU kills half the activations, so He init uses 2× variance to compensate
  • µP insight: learning rate should scale with layer width to keep update magnitudes constant

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
WN(0,2/nin)W \sim \mathcal{N}(0, 2/n_{in})

Xavier/Glorot (for tanh/sigmoid):

WU(6nin+nout,6nin+nout)W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)

He/Kaiming (for ReLU):

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

µP (Maximal Update Parameterization): Scales init AND learning rate by width:

WN(0,1/nin),ηW=ηbase/noutW \sim \mathcal{N}(0, 1/n_{in}), \quad \eta_W = \eta_{base} / n_{out}

This enables hyperparameter transfer: tune on small model, scale to large.

Canonical Papers

Understanding the difficulty of training deep feedforward neural networks

Glorot & Bengio2010AISTATS
Read paper →

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Yang et al.2022NeurIPS
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.