Legacy Concept Lab

Model Collapse & Synthetic Data

Synthetic data is essential AND dangerous

Concept 99 of 100TheoryPhase 13
#99CollapseTheory
key equationp_{t+1} = (1-\alpha)p_* + \alpha p_{\theta_t} \rightarrow \text{collapse}
Phase 13: Cutting-edge 2024-2025 researchConcept 99 of 100

Why It Matters for Modern Models

  • Synthetic data is essential AND dangerous
  • Web is increasingly AI-generated: training data pollution
  • Dataset provenance becomes critical for safety

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Like photocopying a photocopy: quality degrades
  • Mode collapse: diversity shrinks, tails disappear
  • Solution: always anchor training with real data

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
pt+1=(1α)p+αpθtcollapsep_{t+1} = (1-\alpha)p_* + \alpha p_{\theta_t} \rightarrow \text{collapse}

Recursive training on synthetic data causes collapse:

pt=(1α)p+αpθtp_t = (1 - \alpha)p_* + \alpha p_{\theta_t}
θt+1=F(pt)\theta_{t+1} = F(p_t)

Repeated application drives pθtp_{\theta_t} away from real pp_* (degeneration).

Mitigation: anchor with real data, filter synthetic outputs, track provenance.

Canonical Papers

AI models collapse when trained on recursively generated data

Shumailov et al.2024Nature
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.