Legacy Concept Lab

Dropout: Stochastic Regularization

Classic regularizer that prevents co-adaptation of features

Concept 52 of 100OptimizationPhase 10

#52DropoutOptimization

key equation\tilde{h} = h \odot m, \quad m \sim \text{Bernoulli}(1-p)

Phase 10: Mathematical foundations & information geometryConcept 52 of 100

Why It Matters for Modern Models

Classic regularizer that prevents co-adaptation of features
Foundation for understanding stochastic regularization (also: droppath, stochastic depth)
Modern LLMs often use minimal dropout—understanding when it helps/hurts is practical knowledge

What is still poorly explained in textbooks and papers:

Dropout injects noise proportional to activation magnitude—implicitly favors robust features
Different dropout rates per layer: higher dropout in final layers often helps
At large scale with lots of data, dropout can hurt: the ensemble benefit is dominated by data diversity

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\tilde{h} = h \odot m, \quad m \sim \text{Bernoulli}(1-p)

Dropout randomly zeros activations during training:

\tilde{h} = h \odot m, \quad m_i \sim \text{Bernoulli}(1-p)

At test time, scale by keep probability:

h_{\text{test}} = (1-p) \cdot h

Or use inverted dropout (scale during training):

\tilde{h} = \frac{h \odot m}{1-p}

Dropout approximates ensemble averaging over $2^n$ sub-networks.

Srivastava et al.2014JMLR

Explore this concept from different angles — like a mathematician would.