Legacy Concept Lab

Knowledge Distillation: Learning from Teachers

How you get small models from large ones—DistilBERT is 40% smaller, 60% faster, 97% of performance

Concept 62 of 100EfficiencyPhase 6
#62DistillationEfficiency
key equation\mathcal{L} = \alpha \mathcal{L}_{KL}(S(x), T(x)) + (1-\alpha) \mathcal{L}_{CE}(S(x), y)
Phase 6: Modern efficiency & inferenceConcept 62 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • How you get small models from large ones—DistilBERT is 40% smaller, 60% faster, 97% of performance
  • Soft targets contain "dark knowledge": teacher's uncertainty about similar classes
  • Modern LLM training uses distillation: smaller models trained on larger model outputs

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Hard labels say "cat, not dog"; soft labels say "mostly cat, a bit dog, definitely not car"
  • Temperature τ controls how much dark knowledge transfers: τ→∞ means uniform (no info)
  • Distillation works even when student has different architecture than teacher

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
L=αLKL(S(x),T(x))+(1α)LCE(S(x),y)\mathcal{L} = \alpha \mathcal{L}_{KL}(S(x), T(x)) + (1-\alpha) \mathcal{L}_{CE}(S(x), y)

Train a student SS to match a teacher TT's soft predictions:

L=(1α)LCE(S(x),y)+αLKL(S(x),T(x))\mathcal{L} = (1-\alpha) \mathcal{L}_{CE}(S(x), y) + \alpha \mathcal{L}_{KL}(S(x), T(x))

Use temperature τ\tau to soften teacher outputs:

piT=exp(ziT/τ)jexp(zjT/τ)p_i^T = \frac{\exp(z_i^T / \tau)}{\sum_j \exp(z_j^T / \tau)}

Higher τ\tau → more uniform → more information about relative similarities.

Canonical Papers

Distilling the Knowledge in a Neural Network

Hinton, Vinyals, Dean2015NIPS Workshop
Read paper →

DistilBERT, a distilled version of BERT

Sanh et al.2019arXiv
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.