Legacy Concept Lab
Grokking: Delayed Generalization
Challenges classical learning theory: generalization can happen AFTER interpolation, not before
#45GrokkingTheory
key equation
\text{Test loss}(t) \xrightarrow{\text{grokking}} \text{Train loss}(t) \text{ at } t \gg t_{\text{memorization}}Phase 8: Scaling, theory & multimodalConcept 45 of 100
Why It Matters for Modern Models
- Challenges classical learning theory: generalization can happen AFTER interpolation, not before
- Connects to double descent and neural scaling: more compute can unlock generalization that appears "impossible"
- Explains why small algorithmic tasks sometimes need surprisingly long training
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Grokking reveals that memorization and generalization use different circuits—training eventually finds the generalizing one
- Weight decay is key: it slowly shrinks the memorizing solution until the generalizing solution wins
- Phase transitions in loss curves suggest discrete "algorithm discovery" rather than smooth learning
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Grokking is the phenomenon where models suddenly generalize long after memorizing training data:
Training dynamics show three phases:
- Memorization: Training loss → 0, test loss high
- Plateau: Both losses stable for many steps
- Grokking: Test loss suddenly drops to match training
The transition happens at a critical point where:
Weight decay strength controls when grokking occurs—stronger decay → earlier grokking.