Legacy Concept Lab
Dropout: Stochastic Regularization
Classic regularizer that prevents co-adaptation of features
#52DropoutOptimization
key equation
\tilde{h} = h \odot m, \quad m \sim \text{Bernoulli}(1-p)Phase 10: Mathematical foundations & information geometryConcept 52 of 100
Why It Matters for Modern Models
- Classic regularizer that prevents co-adaptation of features
- Foundation for understanding stochastic regularization (also: droppath, stochastic depth)
- Modern LLMs often use minimal dropout—understanding when it helps/hurts is practical knowledge
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Dropout injects noise proportional to activation magnitude—implicitly favors robust features
- Different dropout rates per layer: higher dropout in final layers often helps
- At large scale with lots of data, dropout can hurt: the ensemble benefit is dominated by data diversity
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Dropout randomly zeros activations during training:
At test time, scale by keep probability:
Or use inverted dropout (scale during training):
Dropout approximates ensemble averaging over sub-networks.