Legacy Concept Lab
Weight Initialization: Xavier, He & µP
Bad initialization → vanishing/exploding activations → training fails immediately
#48InitOptimization
key equation
W \sim \mathcal{N}(0, 2/n_{in})Phase 3: Optimization & generalizationConcept 48 of 100
Why It Matters for Modern Models
- Bad initialization → vanishing/exploding activations → training fails immediately
- µP is how labs scale hyperparameters: tune on small proxy, apply to full-scale training
- Connects to NTK theory: at infinite width with proper scaling, training becomes deterministic
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- The "1/√n" scaling keeps variance constant through layers: Var(output) ≈ Var(input)
- ReLU kills half the activations, so He init uses 2× variance to compensate
- µP insight: learning rate should scale with layer width to keep update magnitudes constant
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Xavier/Glorot (for tanh/sigmoid):
He/Kaiming (for ReLU):
µP (Maximal Update Parameterization): Scales init AND learning rate by width:
This enables hyperparameter transfer: tune on small model, scale to large.