Legacy Concept Lab

Weight Initialization: Xavier, He & µP

Bad initialization → vanishing/exploding activations → training fails immediately

Concept 48 of 100OptimizationPhase 3

#48InitOptimization

key equationW \sim \mathcal{N}(0, 2/n_{in})

Phase 3: Optimization & generalizationConcept 48 of 100

Why It Matters for Modern Models

Bad initialization → vanishing/exploding activations → training fails immediately
µP is how labs scale hyperparameters: tune on small proxy, apply to full-scale training
Connects to NTK theory: at infinite width with proper scaling, training becomes deterministic

What is still poorly explained in textbooks and papers:

The "1/√n" scaling keeps variance constant through layers: Var(output) ≈ Var(input)
ReLU kills half the activations, so He init uses 2× variance to compensate
µP insight: learning rate should scale with layer width to keep update magnitudes constant

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

W \sim \mathcal{N}(0, 2/n_{in})

Xavier/Glorot (for tanh/sigmoid):

W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)

He/Kaiming (for ReLU):

W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

µP (Maximal Update Parameterization): Scales init AND learning rate by width:

W \sim \mathcal{N}(0, 1/n_{in}), \quad \eta_W = \eta_{base} / n_{out}

This enables hyperparameter transfer: tune on small model, scale to large.

Glorot & Bengio2010AISTATS

Yang et al.2022NeurIPS

Explore this concept from different angles — like a mathematician would.