Legacy Concept Lab

Knowledge Distillation: Learning from Teachers

How you get small models from large ones—DistilBERT is 40% smaller, 60% faster, 97% of performance

Concept 62 of 100EfficiencyPhase 6

#62DistillationEfficiency

key equation\mathcal{L} = \alpha \mathcal{L}_{KL}(S(x), T(x)) + (1-\alpha) \mathcal{L}_{CE}(S(x), y)

Phase 6: Modern efficiency & inferenceConcept 62 of 100

Why It Matters for Modern Models

How you get small models from large ones—DistilBERT is 40% smaller, 60% faster, 97% of performance
Soft targets contain "dark knowledge": teacher's uncertainty about similar classes
Modern LLM training uses distillation: smaller models trained on larger model outputs

What is still poorly explained in textbooks and papers:

Hard labels say "cat, not dog"; soft labels say "mostly cat, a bit dog, definitely not car"
Temperature τ controls how much dark knowledge transfers: τ→∞ means uniform (no info)
Distillation works even when student has different architecture than teacher

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\mathcal{L} = \alpha \mathcal{L}_{KL}(S(x), T(x)) + (1-\alpha) \mathcal{L}_{CE}(S(x), y)

Train a student $S$ to match a teacher $T$ 's soft predictions:

\mathcal{L} = (1-\alpha) \mathcal{L}_{CE}(S(x), y) + \alpha \mathcal{L}_{KL}(S(x), T(x))

Use temperature $\tau$ to soften teacher outputs:

p_i^T = \frac{\exp(z_i^T / \tau)}{\sum_j \exp(z_j^T / \tau)}

Higher $\tau$ → more uniform → more information about relative similarities.

Hinton, Vinyals, Dean2015NIPS Workshop

Sanh et al.2019arXiv

Explore this concept from different angles — like a mathematician would.