Legacy Concept Lab

Synthetic Data & Self-Improvement

Key to modern training: Phi, LLaMA 3, many models use synthetic data

Concept 77 of 100Scaling & AlignmentPhase 11

#77Synth DataScaling & Alignment

key equation\theta_{t+1} = \text{Train}(\theta_t, \text{Generate}(\theta_t, \text{Filter}))

Phase 11: Frontier research & scalingConcept 77 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

Models can teach themselves if we filter for correct answers
Synthetic data amplifies capabilities the model already has (via distillation)
The "data wall" problem: we're running out of internet text, synthetics are the solution

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\theta_{t+1} = \text{Train}(\theta_t, \text{Generate}(\theta_t, \text{Filter}))

Synthetic data generation:

Generate candidates: $y \sim p_\theta(y | x)$
Filter for quality: $\{(x_i, y_i) : V(y_i) > \tau\}$
Train on filtered data: $\theta' = \arg\min \mathcal{L}(\theta; \mathcal{D}_{synth})$

Self-improvement loop:

\theta_{t+1} = \text{Train}(\theta_t, \text{Generate}(\theta_t, \text{Filter}))

Phi-1 insight: Small model + high-quality synthetic data > Large model + web data

Quality filtering: use reward models, verifiers, or consistency checks.

Gunasekar et al.2023arXiv

Wang et al.2023ACL

Explore this concept from different angles — like a mathematician would.