Legacy Concept Lab
Synthetic Data & Self-Improvement
Key to modern training: Phi, LLaMA 3, many models use synthetic data
#77Synth DataScaling & Alignment
key equation
\theta_{t+1} = \text{Train}(\theta_t, \text{Generate}(\theta_t, \text{Filter}))Phase 11: Frontier research & scalingConcept 77 of 100
Why It Matters for Modern Models
- Key to modern training: Phi, LLaMA 3, many models use synthetic data
- Enables training without human annotation at scale
- Data quality > quantity: careful curation beats raw scale
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Models can teach themselves if we filter for correct answers
- Synthetic data amplifies capabilities the model already has (via distillation)
- The "data wall" problem: we're running out of internet text, synthetics are the solution
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Synthetic data generation:
- Generate candidates:
- Filter for quality:
- Train on filtered data:
Self-improvement loop:
Phi-1 insight: Small model + high-quality synthetic data > Large model + web data
Quality filtering: use reward models, verifiers, or consistency checks.