Legacy Concept Lab

Grokking: Delayed Generalization

Challenges classical learning theory: generalization can happen AFTER interpolation, not before

Concept 45 of 100TheoryPhase 8
#45GrokkingTheory
key equation\text{Test loss}(t) \xrightarrow{\text{grokking}} \text{Train loss}(t) \text{ at } t \gg t_{\text{memorization}}
Phase 8: Scaling, theory & multimodalConcept 45 of 100

Why It Matters for Modern Models

  • Challenges classical learning theory: generalization can happen AFTER interpolation, not before
  • Connects to double descent and neural scaling: more compute can unlock generalization that appears "impossible"
  • Explains why small algorithmic tasks sometimes need surprisingly long training

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Grokking reveals that memorization and generalization use different circuits—training eventually finds the generalizing one
  • Weight decay is key: it slowly shrinks the memorizing solution until the generalizing solution wins
  • Phase transitions in loss curves suggest discrete "algorithm discovery" rather than smooth learning

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
Test loss(t)grokkingTrain loss(t) at ttmemorization\text{Test loss}(t) \xrightarrow{\text{grokking}} \text{Train loss}(t) \text{ at } t \gg t_{\text{memorization}}

Grokking is the phenomenon where models suddenly generalize long after memorizing training data:

Training dynamics show three phases:

  1. Memorization: Training loss → 0, test loss high
  2. Plateau: Both losses stable for many steps
  3. Grokking: Test loss suddenly drops to match training

The transition happens at a critical point where:

Generalization gapO(θ2n)\text{Generalization gap} \approx O\left(\frac{\|\theta\|^2}{n}\right)

Weight decay strength λ\lambda controls when grokking occurs—stronger decay → earlier grokking.

Canonical Papers

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Power et al.2022ICLR
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.