Legacy Concept Lab
Knowledge Distillation: Learning from Teachers
How you get small models from large ones—DistilBERT is 40% smaller, 60% faster, 97% of performance
#62DistillationEfficiency
key equation
\mathcal{L} = \alpha \mathcal{L}_{KL}(S(x), T(x)) + (1-\alpha) \mathcal{L}_{CE}(S(x), y)Phase 6: Modern efficiency & inferenceConcept 62 of 100
Why It Matters for Modern Models
- How you get small models from large ones—DistilBERT is 40% smaller, 60% faster, 97% of performance
- Soft targets contain "dark knowledge": teacher's uncertainty about similar classes
- Modern LLM training uses distillation: smaller models trained on larger model outputs
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Hard labels say "cat, not dog"; soft labels say "mostly cat, a bit dog, definitely not car"
- Temperature τ controls how much dark knowledge transfers: τ→∞ means uniform (no info)
- Distillation works even when student has different architecture than teacher
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Train a student to match a teacher 's soft predictions:
Use temperature to soften teacher outputs:
Higher → more uniform → more information about relative similarities.