12Representations

⚲Probing, Linear Classifier Probes & Activation Analysis

Canonical Papers

Understanding Intermediate Layers using Linear Classifier Probes

Alain & Bengio2016ICLR Workshop

Read paper →

BERT Rediscovers the Classical NLP Pipeline

Tenney et al.2019ACL

Read paper →

Core Mathematics

Given layer representation $h_\ell(x)$ , train a frozen probe:

\hat y = W h_\ell(x) + b

(or a softmax over $Wh_\ell(x)$ ) on a supervised task (POS tags, parse trees, etc.). The accuracy estimates how linearly separable that information is at layer $\ell$ .

BERT layers roughly follow the classical NLP pipeline (POS → syntax → semantics → coreference).

Key Equation

\hat y = W h_\ell(x) + b

Interactive Visualization

Why It Matters for Modern Models

Probing is one of the main tools to understand what GPT-like models know and where that knowledge lives
Used heavily for safety (probing for dangerous capabilities), robustness, and fairness analyses

Missing Intuition

What is still poorly explained in textbooks and papers:

Clear mental model of what probes measure (information content vs ease of extraction)
Visual, layer-by-layer maps of information flow in large LMs

Connections

Prerequisites

◎Embeddings

Enables

⊛Circuits 🔍Sparse Autoencoders

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

⚠️ Breaks When

⊕Non-causal probes→Superposition ⊕Features overlap→Superposition

🔧 Invented to Fix

🔬Causal circuits→Circuit Discovery

≈ Analogy

ΘLinear lens→NTK

↔️ Mathematical Dual

🔬Top-down vs bottom-up→Circuit Discovery