Legacy Concept Lab

Calibration & Temperature Scaling

Modern deep networks are often overconfident—90% confidence doesn't mean 90% accuracy

Concept 66 of 100TheoryPhase 8

#66CalibrationTheory

key equationP(Y = \hat{Y} | \hat{P} = p) = p

Phase 8: Scaling, theory & multimodalConcept 66 of 100

Why It Matters for Modern Models

Modern deep networks are often overconfident—90% confidence doesn't mean 90% accuracy
Critical for downstream decisions: medical diagnosis, autonomous driving need honest uncertainty
LLM "hallucination confidence" is a calibration failure—model is certain about wrong things

What is still poorly explained in textbooks and papers:

Neural nets maximize log-likelihood, not calibration—these are different objectives
Temperature scaling is surprisingly effective: one scalar fixes most miscalibration
Bigger models are often LESS calibrated—scale doesn't solve everything

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

P(Y = \hat{Y} | \hat{P} = p) = p

A model is calibrated if confidence matches accuracy:

P(Y = \hat{Y} | \hat{P} = p) = p

Expected Calibration Error (ECE):

ECE = \sum_{m=1}^M \frac{|B_m|}{n} |\text{acc}(B_m) - \text{conf}(B_m)|

Temperature scaling: Learn a single scalar $T$ on validation set:

q_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}

$T > 1$ softens predictions (reduces overconfidence).

Guo et al.2017ICML

Kumar et al.2019NeurIPS

Explore this concept from different angles — like a mathematician would.