12Representations

Probing, Linear Classifier Probes & Activation Analysis

Canonical Papers

Understanding Intermediate Layers using Linear Classifier Probes

Alain & Bengio2016ICLR Workshop
Read paper →

BERT Rediscovers the Classical NLP Pipeline

Tenney et al.2019ACL
Read paper →

Core Mathematics

Given layer representation h(x)h_\ell(x), train a frozen probe:

y^=Wh(x)+b\hat y = W h_\ell(x) + b

(or a softmax over Wh(x)Wh_\ell(x)) on a supervised task (POS tags, parse trees, etc.). The accuracy estimates how linearly separable that information is at layer \ell.

BERT layers roughly follow the classical NLP pipeline (POS → syntax → semantics → coreference).

Key Equation
y^=Wh(x)+b\hat y = W h_\ell(x) + b

Interactive Visualization

Why It Matters for Modern Models

  • Probing is one of the main tools to understand what GPT-like models know and where that knowledge lives
  • Used heavily for safety (probing for dangerous capabilities), robustness, and fairness analyses

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Clear mental model of what probes measure (information content vs ease of extraction)
  • Visual, layer-by-layer maps of information flow in large LMs

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.