Legacy Concept Lab

Synthetic Data & Self-Improvement

Key to modern training: Phi, LLaMA 3, many models use synthetic data

Concept 77 of 100Scaling & AlignmentPhase 11
#77Synth DataScaling & Alignment
key equation\theta_{t+1} = \text{Train}(\theta_t, \text{Generate}(\theta_t, \text{Filter}))
Phase 11: Frontier research & scalingConcept 77 of 100

Why It Matters for Modern Models

  • Key to modern training: Phi, LLaMA 3, many models use synthetic data
  • Enables training without human annotation at scale
  • Data quality > quantity: careful curation beats raw scale

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Models can teach themselves if we filter for correct answers
  • Synthetic data amplifies capabilities the model already has (via distillation)
  • The "data wall" problem: we're running out of internet text, synthetics are the solution

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
θt+1=Train(θt,Generate(θt,Filter))\theta_{t+1} = \text{Train}(\theta_t, \text{Generate}(\theta_t, \text{Filter}))

Synthetic data generation:

  1. Generate candidates: ypθ(yx)y \sim p_\theta(y | x)
  2. Filter for quality: {(xi,yi):V(yi)>τ}\{(x_i, y_i) : V(y_i) > \tau\}
  3. Train on filtered data: θ=argminL(θ;Dsynth)\theta' = \arg\min \mathcal{L}(\theta; \mathcal{D}_{synth})

Self-improvement loop:

θt+1=Train(θt,Generate(θt,Filter))\theta_{t+1} = \text{Train}(\theta_t, \text{Generate}(\theta_t, \text{Filter}))

Phi-1 insight: Small model + high-quality synthetic data > Large model + web data

Quality filtering: use reward models, verifiers, or consistency checks.

Canonical Papers

Textbooks Are All You Need

Gunasekar et al.2023arXiv
Read paper →

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang et al.2023ACL
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.