Legacy Concept Lab

Label Smoothing & Soft Targets

Used in most vision models and LLMs—simple trick with consistent improvements

Concept 60 of 100OptimizationPhase 3

#60Label SmoothOptimization

key equationy_{smooth} = (1 - \alpha) y + \frac{\alpha}{K}

Phase 3: Optimization & generalizationConcept 60 of 100

Why It Matters for Modern Models

Used in most vision models and LLMs—simple trick with consistent improvements
Prevents overconfidence, which improves calibration and sometimes generalization
Knowledge distillation uses the same idea: train on soft targets from a teacher model

What is still poorly explained in textbooks and papers:

Hard targets say "100% sure it's class 3"—but that's almost never true in real data
Label smoothing implicitly regularizes: model can't drive logits to ±∞
Connects to calibration: smoothed models give more honest uncertainty estimates

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

y_{smooth} = (1 - \alpha) y + \frac{\alpha}{K}

Instead of hard targets $y = [0, 0, 1, 0]$ , use soft targets:

y_{smooth} = (1 - \alpha) y + \frac{\alpha}{K}

For $\alpha = 0.1$ and $K = 4$ classes: $[0.025, 0.025, 0.925, 0.025]$

Effect on cross-entropy:

\mathcal{L}_{smooth} = (1-\alpha) \mathcal{L}_{CE}(p, y) + \alpha \mathcal{L}_{CE}(p, u)

where $u$ is uniform. This penalizes overconfidence: logits can't go to infinity.

Szegedy et al.2016CVPR

Müller, Kornblith, Hinton2019NeurIPS

Explore this concept from different angles — like a mathematician would.