Legacy Concept Lab
Model Collapse & Synthetic Data
Synthetic data is essential AND dangerous
#99CollapseTheory
key equation
p_{t+1} = (1-\alpha)p_* + \alpha p_{\theta_t} \rightarrow \text{collapse}Phase 13: Cutting-edge 2024-2025 researchConcept 99 of 100
Why It Matters for Modern Models
- Synthetic data is essential AND dangerous
- Web is increasingly AI-generated: training data pollution
- Dataset provenance becomes critical for safety
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Like photocopying a photocopy: quality degrades
- Mode collapse: diversity shrinks, tails disappear
- Solution: always anchor training with real data
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Recursive training on synthetic data causes collapse:
Repeated application drives away from real (degeneration).
Mitigation: anchor with real data, filter synthetic outputs, track provenance.