27Representations

🔍Sparse Autoencoders at Scale: Feature Dictionaries for Mechanistic Interpretability

Canonical Papers

Scaling and evaluating sparse autoencoders

Gao et al. (OpenAI)2024arXiv

Read paper →

Improving Dictionary Learning with Gated Sparse Autoencoders

Rajamanoharan et al.2024arXiv

Read paper →

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks et al.2024arXiv

Read paper →

Core Mathematics

Sparse Autoencoders (SAEs) learn a feature dictionary to recover interpretable directions from dense model activations, solving the superposition problem where networks pack many features into fewer dimensions.

Let $x \in \mathbb{R}^d$ be a model activation vector (often residual stream at a layer).

SAE encode → sparse latents, decode → reconstruction:

z = \text{ReLU}(W_{\text{enc}}(x-b_{\text{pre}})+b_{\text{enc}}), \qquad \hat x = W_{\text{dec}} z + b_{\text{pre}}

This makes the key idea concrete: a sparse code $z$ explains the dense activation $x$ via a learned dictionary $W_{\text{dec}}$ .

Classic SAE objective (reconstruction + sparsity penalty):

\mathcal{L} = \lVert x-\hat x\rVert_2^2 + \lambda \lVert z\rVert_1

Sparsity is what pushes toward monosemantic, human-nameable latents instead of dense uninterpretable factors.

Modern "Top-K / k-sparse" variant (directly control sparsity):

z = \text{TopK}(W_{\text{enc}}(x-b_{\text{pre}}), k), \qquad \mathcal{L} = \lVert x-\hat x\rVert_2^2

This replaces "tune $\lambda$ " with "set exactly k active features per token," improving reconstruction–sparsity frontier.

Key Equation

z = \text{TopK}(W_{\text{enc}}(x-b_{\text{pre}}), k)

Interactive Visualization

Why It Matters for Modern Models

OpenAI trained 16M latent SAE on GPT-4, Anthropic extracted interpretable features from Claude 3—this is how frontier labs actually do interpretability at scale
Neurons are polysemantic (respond to multiple unrelated concepts)—SAEs give scalable substitute unit: feature latents that are monosemantic
After teaching superposition (#11), this shows how to actually recover features from real models—turning theory into practical method
Enables feature-level circuit analysis instead of head/neuron circuits—scalable causal graphs built from interpretable units
Bridges mechanistic interpretability (#11-13) to alignment (#24-26)—once you can name internal features, you can measure, audit, and intervene

Missing Intuition

What is still poorly explained in textbooks and papers:

SAE = "learn a parts dictionary for the residual stream"—each column of $W_{\text{dec}}$ is candidate feature direction, sparse $z$ says which parts are present
Sparsity is the interpretability prior—dense code can represent anything but names nothing, sparse codes force reuse of directions across similar contexts
Real knob is "explanatory budget per token"—in k-sparse SAEs, $k$ literally sets how many features can explain an activation
SAE pathologies are not side notes—shrinkage bias from L1, dead latents at scale—these are the whole game, modern recipes explicitly address them
Interpretability becomes actionable when features become causal handles—ablation, amplification, steering turn "interpretability" into "debugging & control"

Connections

Prerequisites

◎Embeddings ⊕Superposition ⚲Probing ⊛Circuits

Enables

🔬Circuit Discovery 🎚️Activation Steering

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

🔄 Same Technique

🔀Top-k gating→MoE ◎Dictionary learning→Embeddings

🔧 Invented to Fix

⊕Disentangle features→Superposition ⊕Feature dictionaries→Superposition 🎚️Targeted intervention→Activation Steering

⚠️ Breaks When

🎚️Feature entanglement→Activation Steering

≈ Analogy

🔤Codebook sparsity→Tokens