Legacy Concept Lab

Grokking: Delayed Generalization

Challenges classical learning theory: generalization can happen AFTER interpolation, not before

Concept 45 of 100TheoryPhase 8

#45GrokkingTheory

key equation\text{Test loss}(t) \xrightarrow{\text{grokking}} \text{Train loss}(t) \text{ at } t \gg t_{\text{memorization}}

Phase 8: Scaling, theory & multimodalConcept 45 of 100

Why It Matters for Modern Models

Challenges classical learning theory: generalization can happen AFTER interpolation, not before
Connects to double descent and neural scaling: more compute can unlock generalization that appears "impossible"
Explains why small algorithmic tasks sometimes need surprisingly long training

What is still poorly explained in textbooks and papers:

Grokking reveals that memorization and generalization use different circuits—training eventually finds the generalizing one
Weight decay is key: it slowly shrinks the memorizing solution until the generalizing solution wins
Phase transitions in loss curves suggest discrete "algorithm discovery" rather than smooth learning

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\text{Test loss}(t) \xrightarrow{\text{grokking}} \text{Train loss}(t) \text{ at } t \gg t_{\text{memorization}}

Grokking is the phenomenon where models suddenly generalize long after memorizing training data:

Training dynamics show three phases:

The transition happens at a critical point where:

\text{Generalization gap} \approx O\left(\frac{\|\theta\|^2}{n}\right)

Weight decay strength $\lambda$ controls when grokking occurs—stronger decay → earlier grokking.

Power et al.2022ICLR

Explore this concept from different angles — like a mathematician would.