Machine Learning

Classification Metrics, Thresholds, and Calibration

Classification metrics turn scores into decisions, expose threshold tradeoffs, and check whether probabilities mean what they claim.

status: publishedimportance: criticaldifficulty 3/5math: undergraduateread: 19mlive demo

Back to Machine Learning Next: Model Selection and Hyperparameter Search

Concept Structure

Classification Metrics, Thresholds, and Calibration

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

2prerequisites

1next concepts

3related links

Learning map

Classification Metrics, Thresholds, and Calibration

BeforeLogistic RegressionNow4/4 sections readyTryManipulate one control and predict the visible change.NextModel Selection and Hyperparameter Search

Object flow

4/4 sections readyAsk about this Research room

ConceptClassification Metrics, Thresholds, and CalibrationMachine Learning EquationClassification Metrics, Thresholds, and Calibration equation 1Exact equation object CodeClassification Metrics, Thresholds, and Calibration code witness 1Exact code witness DemoClassification Metrics, Thresholds, and Calibration interactive demoVisualization object ClaimPrecision, recall, F1, ROC/PR behavior, and calibration answer differ...Exact claim check Sourcescikit-learn User Guide: Classification metricsExact source object

ConceptClassification Metrics, Thresholds, and CalibrationMachine Learning

4 sources attachedLocal snapshot ready

concept:machine-learning/classification-metrics-calibration

Codewitness nearby Predictbefore reveal Roomobject handoff

Conceptual Bridge

What should feel connected as you move through this page.

Carry inLogistic Regression

Bring the mental model from Logistic Regression; this page will reuse it instead of restarting from zero.

Work hereClassification Metrics, Thresholds, and Calibration

Classification metrics turn scores into decisions, expose threshold tradeoffs, and check whether probabilities mean what they claim.

Carry outModel Selection and Hyperparameter Search

The next edge should feel earned: use the demo prediction here before following Model Selection and Hyperparameter Search.

Test the linkManipulate one control and predict the visible change.Then continue to Model Selection and Hyperparameter Search

01IntuitionStart with the picture, metaphor, or geometric mechanism.02MathMake the objects explicit and connect them with notation.03CodeMirror the equations with runnable implementation details.04Interactive DemoManipulate the mechanism and watch the idea respond.

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

You are here because a classifier rarely gives only "yes" or "no." It usually gives a score or probability first, then someone chooses a threshold that turns that score into an action.

Before this, know logistic regression and why train/dev/test separation matters. By the end, you should be able to explain why precision and recall trade off, why ROC and precision-recall curves sweep thresholds, and why a confident probability can still be badly calibrated.

Start with a spam filter, medical screen, fraud detector, or safety classifier. The model says:

I think this case has score $s=0.72$ .

That score is not yet the decision. If the threshold is $t=0.5$ , the case is positive. If the threshold is $t=0.9$ , the same case is negative. Changing $t$ does not retrain the model; it changes what kinds of mistakes you are willing to make.

Two mistakes matter:

A false positive says "positive" when the true label is negative.
A false negative says "negative" when the true label is positive.

Precision asks: among the cases we called positive, how many were truly positive? Recall asks: among the truly positive cases, how many did we catch?

Raising the threshold often increases precision because the model only accepts stronger positive scores. But it usually lowers recall because more true positives fall below the threshold. Lowering the threshold usually does the opposite.

Calibration asks a different question. If the model says $0.8$ on many examples, are about $80\%$ of those examples actually positive? A classifier can rank examples well and still be overconfident. A threshold can create a useful decision rule and still sit on top of probabilities that should not be trusted as frequencies.

Three caveats keep this honest:

Choose thresholds on dev data, not test data. Test is for the final estimate after the threshold rule is fixed.
No metric is universal. Precision, recall, F1, ROC AUC, PR AUC, and calibration each hide a different value judgment.
Calibration is about probabilities, not only accuracy. A model can be accurate and still have probabilities that are too sharp or too timid.

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Equation 1\hat y_i(t)=\mathbb 1[s_i\ge t].Equation 2\mathrm{TP}(t)=\sum_i \mathbb 1[\hat y_i(t)=1,\ y_i=1]; \mathrm{FP}(t)=\sum_i \mathbb 1[\hat...

For binary labels $y_i\in\{0,1\}$ and model scores $s_i\in[0,1]$ , a threshold $t$ creates hard predictions

\hat y_i(t)=\mathbb 1[s_i\ge t].

From those predictions, define the confusion counts:

\begin{aligned} \mathrm{TP}(t)&=\sum_i \mathbb 1[\hat y_i(t)=1,\ y_i=1],\\ \mathrm{FP}(t)&=\sum_i \mathbb 1[\hat y_i(t)=1,\ y_i=0],\\ \mathrm{FN}(t)&=\sum_i \mathbb 1[\hat y_i(t)=0,\ y_i=1],\\ \mathrm{TN}(t)&=\sum_i \mathbb 1[\hat y_i(t)=0,\ y_i=0]. \end{aligned}

Precision and recall are

\mathrm{precision}(t)=\frac{\mathrm{TP}(t)}{\mathrm{TP}(t)+\mathrm{FP}(t)}, \qquad \mathrm{recall}(t)=\frac{\mathrm{TP}(t)}{\mathrm{TP}(t)+\mathrm{FN}(t)}.

F1 is their harmonic mean:

F_1(t)=\frac{2\,\mathrm{precision}(t)\,\mathrm{recall}(t)} {\mathrm{precision}(t)+\mathrm{recall}(t)}.

The harmonic mean is harsh when either side is small. That is useful when you want both "few false alarms" and "few misses," but it is not a replacement for a real utility or safety cost.

ROC curves sweep $t$ and plot true positive rate against false positive rate:

\mathrm{TPR}(t)=\mathrm{recall}(t), \qquad \mathrm{FPR}(t)=\frac{\mathrm{FP}(t)}{\mathrm{FP}(t)+\mathrm{TN}(t)}.

Precision-recall curves also sweep $t$ , but plot precision against recall. When positives are rare, precision-recall views often make the positive-class tradeoff easier to see than ROC views.

Calibration ignores the hard threshold and asks whether scores behave like probabilities. For a score bin $B_m=\{i:s_i\in(a_m,b_m]\}$ , define

\mathrm{conf}(B_m)=\frac{1}{|B_m|}\sum_{i\in B_m}s_i, \qquad \mathrm{acc}(B_m)=\frac{1}{|B_m|}\sum_{i\in B_m}y_i.

A perfectly calibrated model would satisfy $\mathrm{acc}(B_m)\approx \mathrm{conf}(B_m)$ for meaningful bins. A simple reliability-gap summary is

\mathrm{ECE}=\sum_m\frac{|B_m|}{n}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|.

The Brier score keeps the probability scale directly:

\mathrm{Brier}=\frac1n\sum_i(s_i-y_i)^2.

The threshold $t$ changes decisions. Calibration changes what the scores mean. In a clean workflow, you learn the model on train, choose thresholds or calibration parameters on dev or cross-validation folds, and report final metrics once on test.

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

Code witness 1import numpy as np y = np.array([1,1,1,1,1,0,0,0,0,0,0,0]) s = np.array([.93,.82,.74,.58,.41,...python

import numpy as np

y = np.array([1,1,1,1,1,0,0,0,0,0,0,0])
s = np.array([.93,.82,.74,.58,.41,.88,.63,.52,.37,.22,.18,.07])
t = 0.70
yhat = (s >= t).astype(int)

tp = int(np.sum((yhat == 1) & (y == 1)))
fp = int(np.sum((yhat == 1) & (y == 0)))
fn = int(np.sum((yhat == 0) & (y == 1)))
tn = int(np.sum((yhat == 0) & (y == 0)))

precision = tp / max(tp + fp, 1)
recall = tp / max(tp + fn, 1)
f1 = 2 * precision * recall / max(precision + recall, 1e-12)
brier = np.mean((s - y) ** 2)

bins = [(0.0, 0.25), (0.25, 0.50), (0.50, 0.75), (0.75, 1.0)]
ece = 0.0
for lo, hi in bins:
    mask = (s > lo) & (s <= hi)
    if np.any(mask):
        conf = np.mean(s[mask])
        acc = np.mean(y[mask])
        ece += mask.mean() * abs(acc - conf)

print("confusion:", {"tp": tp, "fp": fp, "fn": fn, "tn": tn})
print("precision, recall, f1:", np.round([precision, recall, f1], 3))
print("brier, ece:", round(brier, 3), round(ece, 3))

The code mirrors the math. yhat = (s >= t) is the threshold step, the confusion counts create precision and recall, and the calibration bins compare average confidence to empirical accuracy.

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Move the threshold and choose a score shape. Before revealing the metrics, predict the dominant issue: false alarms, misses, or probability calibration.

The bars show the confusion counts after reveal. The metric panel reports precision, recall, F1, Brier score, and a reliability gap. The point is not to worship one number. The point is to ask which decision cost, class balance, and probability meaning your metric is silently choosing for you.

Live Concept Demo

Explore Classification Metrics, Thresholds, and Calibration

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 3/5undergraduatecode-aligned

Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Classification Metrics, Thresholds, and Calibration should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

Classification metrics turn scores into decisions, expose threshold tradeoffs, and check whether probabilities mean what they claim.

Prediction open01 / Intuition

Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Classification Metrics, Thresholds, and Calibration should make visible.

Visual Inquiry

Make the image answer a mathematical question

Classification metrics turn scores into decisions, expose threshold tradeoffs, and check whether probabilities mean what they claim.

4/4 stages readyLive demo connected

Visual cueWhich visible object should carry the first intuition?

Inspection depth2/4

Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Classification Metrics, Thresholds, and Calibration easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

documentation · 2026scikit-learn User Guide: Classification metricsscikit-learn developers

Source for confusion-matrix-derived metrics such as precision, recall, F1, ROC AUC, and precision-recall evaluation.

Open source

documentation · 2026scikit-learn User Guide: Tuning the decision threshold for class predictionscikit-learn developers

Source for the separation between probability estimation and decision thresholding.

Open source

documentation · 2026scikit-learn User Guide: Probability calibrationscikit-learn developers

Source for calibration curves, reliability diagrams, and the meaning of calibrated probabilities.

Open source

paper · 2017On Calibration of Modern Neural NetworksGuo, Pleiss, Sun, and Weinberger

Source for modern neural networks often being miscalibrated and for temperature scaling as a post-hoc calibration method.

Open source

Claim Review

Classification metrics turn scores into decisions, expose threshold tradeoffs, and check whether probabilities mean what they claim.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources4 references

sklearn-classification-metrics, sklearn-threshold-tuning, sklearn-probability-calibration, guo-modern-calibration

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedPrecision, recall, F1, ROC/PR behavior, and calibration answer different evaluation questions: thresholds change hard decisions, while calibration asks whether predicted probabilities match empirical frequencies.Claim metadata: source checked

scikit-learn supports confusion-matrix metrics, ROC/PR evaluation, threshold tuning as a separate decision step, and calibration curves/reliability diagrams; Guo et al. support the warning that high accuracy or cross-entropy training does not guarantee calibrated probabilities.

Sources: scikit-learn User Guide: Classification metrics, scikit-learn User Guide: Tuning the decision threshold for class prediction, scikit-learn User Guide: Probability calibration, On Calibration of Modern Neural NetworksThis claim covers binary classification evaluation and calibration intuition; it does not certify a universal best metric, a deployment-specific utility function, multiclass averaging choices, or medical/legal safety thresholds.A bounded review summary is present; still check caveats and exact source scope.

Substantively reviewed after GPT Pro found one blocker: pre-reveal demo-state leakage of hidden metrics into the companion path. Fixed emitDemoState so metrics appear only after reveal, aligned reliability-bin wording, and kept source support tied to scikit-learn metrics/threshold/calibration docs plus Guo et al. Caveats: binary classification teaching setup only; no universal metric or deployment threshold guarantee.

Reviewer: gpt-pro; reviewed 2026-06-28

source-span-sklearn-classification-metrics source-span-sklearn-threshold-tuning source-span-sklearn-probability-calibration source-span-guo-modern-calibration math-object-1 math-object-2 code-witness-1 interactive-demo

Source support candidates

documentation 2026scikit-learn User Guide: Classification metrics

Source for confusion-matrix-derived metrics such as precision, recall, F1, ROC AUC, and precision-recall evaluation.

documentation 2026scikit-learn User Guide: Tuning the decision threshold for class prediction

Source for the separation between probability estimation and decision thresholding.

documentation 2026scikit-learn User Guide: Probability calibration

Source for calibration curves, reliability diagrams, and the meaning of calibrated probabilities.

paper 2017On Calibration of Modern Neural Networks

Source for modern neural networks often being miscalibrated and for temperature scaling as a post-hoc calibration method.

Mechanism witnesses

Equation 1

\hat y_i(t)=\mathbb 1[s_i\ge t].

Equation 2

\begin{aligned} \mathrm{TP}(t)&=\sum_i \mathbb 1[\hat y_i(t)=1,\ y_i=1],\\ \mathrm{FP}(t)&=\sum_i \mathbb 1[\hat y_i(t)=1,\ y_i=0],\\ \mathrm{FN}(t)&=\sum_i \mathbb 1[\hat y_i(t)=0,\ y_i=1],\\ \mathrm{TN}(t)&=\sum_i \mathbb 1[\hat y_i(t)=0,\ y_i=0]. \end{aligned}

Code witness 1import numpy as np y = np.array([1,1,1,1,1,0,0,0,0,0,0,0]) s = np.array([.93,.82,.74,.58,.41,...Demo stateLive mechanism probe

Practice Loop

Try the idea before it explains itself

Classification metrics turn scores into decisions, expose threshold tradeoffs, and check whether probabilities mean what they claim.

Readiness0/3 checks ready

Predict

Before touching the demo, predict one visible change that should happen in Classification Metrics, Thresholds, and Calibration.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Your answer canvas

Local checks

Claim

A concrete answer is on the canvas.

Mechanism

The answer names why the claim should hold.

Bridge

It touches the page context or a neighboring idea.

Misconception check

Object research drawerClose

ConceptClassification Metrics, Thresholds, and CalibrationMachine Learning

Code witness comparisonClassification Metrics, Thresholds, and Calibration code witness 1y = np.array([1,1,1,1,1,0,0,0,0,0,0,0])Prediction before revealClassification Metrics, Thresholds, and Calibration interactive demoManipulate one control and predict the visible change.

Grounded room questionWhat is the smallest example that makes Classification Metrics, Thresholds, and Calibration click without losing the math?Local snapshot ready

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.

Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptMachine Learning

Classification Metrics, Thresholds, and Calibration

Anchored question

What is the smallest example that makes Classification Metrics, Thresholds, and Calibration click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action

Local action draft

This draft stays locally in this browser for concept:machine-learning/classification-metrics-calibration.

Draft noteNext action

No local draft saved.

Evidence to inspect

Source ids to inspect: sklearn-classification-metrics, sklearn-threshold-tuning, sklearn-probability-calibration, guo-modern-calibration
Definition, prerequisite, and contrast concept links
The equation or code witness that makes the concept operational
One demo state that shows the invariant instead of a slogan

What would resolve this

The learner can state the mechanism in their own words
The learner can name the prerequisite that would repair confusion
The learner can predict how the mechanism changes under one perturbation

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Classification Metrics, Thresholds, and Calibration Object key: concept:machine-learning/classification-metrics-calibration Context: Machine Learning Anchor id: concept/concept-notebook/machine-learning/classification-metrics-calibration Open question: What is the smallest example that makes Classification Metrics, Thresholds, and Calibration click without losing the math? Evidence to inspect: - Source ids to inspect: sklearn-classification-metrics, sklearn-threshold-tuning, sklearn-probability-calibration, guo-modern-calibration - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object

concept/concept-notebook/machine-learning/classification-metrics-calibration
concept:machine-learning/classification-metrics-calibration

Learning Map

Before / Now / Try / Next

BeforeLogistic Regression

NowIntuition → Math → Code → Demo

TryManipulate one control and predict the visible change.

NextModel Selection and Hyperparameter Search

Intuitionready
Mathready
Codeready
Interactive Demoready

Object Companion

Ask beside the selected object

Classification metrics turn scores into decisions, expose threshold tradeoffs, and check whether probabilities mean what they claim.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Machine Learning concept. Learning surface: Classification Metrics, Thresholds, and Calibration. What this page says: Classification metrics turn scores into decisions, expose threshold tradeoffs, and check whether probabilities mean what they claim. Current section: Intuition, math, code, and interactive demo. Suggested next step: Manipulate one control and predict the visible change.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Explain the central idea in plain language, then restate it with the exact math objects from the page. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.

Domain

Machine Learning

machine-learningevaluationclassificationcalibrationmetrics

Prerequisites

Logistic Regression Train/Dev/Test Splits, Cross-Validation, and Leakage

Leads To

Model Selection and Hyperparameter Search

Multinomial Logistic Regression Cross-Entropy Label Smoothing & Soft Targets

Within this domain