27Representations

🔍Sparse Autoencoders at Scale: Feature Dictionaries for Mechanistic Interpretability

Canonical Papers

Scaling and evaluating sparse autoencoders

Gao et al. (OpenAI)2024arXiv
Read paper →

Improving Dictionary Learning with Gated Sparse Autoencoders

Rajamanoharan et al.2024arXiv
Read paper →

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks et al.2024arXiv
Read paper →

Core Mathematics

Sparse Autoencoders (SAEs) learn a feature dictionary to recover interpretable directions from dense model activations, solving the superposition problem where networks pack many features into fewer dimensions.

Let xRdx \in \mathbb{R}^d be a model activation vector (often residual stream at a layer).

SAE encode → sparse latents, decode → reconstruction:

z=ReLU(Wenc(xbpre)+benc),x^=Wdecz+bprez = \text{ReLU}(W_{\text{enc}}(x-b_{\text{pre}})+b_{\text{enc}}), \qquad \hat x = W_{\text{dec}} z + b_{\text{pre}}

This makes the key idea concrete: a sparse code $z$ explains the dense activation xx via a learned dictionary WdecW_{\text{dec}}.

Classic SAE objective (reconstruction + sparsity penalty):

L=xx^22+λz1\mathcal{L} = \lVert x-\hat x\rVert_2^2 + \lambda \lVert z\rVert_1

Sparsity is what pushes toward monosemantic, human-nameable latents instead of dense uninterpretable factors.

Modern "Top-K / k-sparse" variant (directly control sparsity):

z=TopK(Wenc(xbpre),k),L=xx^22z = \text{TopK}(W_{\text{enc}}(x-b_{\text{pre}}), k), \qquad \mathcal{L} = \lVert x-\hat x\rVert_2^2

This replaces "tune λ\lambda" with "set exactly k active features per token," improving reconstruction–sparsity frontier.

Key Equation
z=TopK(Wenc(xbpre),k)z = \text{TopK}(W_{\text{enc}}(x-b_{\text{pre}}), k)

Interactive Visualization

Why It Matters for Modern Models

  • OpenAI trained 16M latent SAE on GPT-4, Anthropic extracted interpretable features from Claude 3—this is how frontier labs actually do interpretability at scale
  • Neurons are polysemantic (respond to multiple unrelated concepts)—SAEs give scalable substitute unit: feature latents that are monosemantic
  • After teaching superposition (#11), this shows how to actually recover features from real models—turning theory into practical method
  • Enables feature-level circuit analysis instead of head/neuron circuits—scalable causal graphs built from interpretable units
  • Bridges mechanistic interpretability (#11-13) to alignment (#24-26)—once you can name internal features, you can measure, audit, and intervene

Missing Intuition

What is still poorly explained in textbooks and papers:

  • SAE = "learn a parts dictionary for the residual stream"—each column of $W_{\text{dec}}$ is candidate feature direction, sparse $z$ says which parts are present
  • Sparsity is the interpretability prior—dense code can represent anything but names nothing, sparse codes force reuse of directions across similar contexts
  • Real knob is "explanatory budget per token"—in k-sparse SAEs, $k$ literally sets how many features can explain an activation
  • SAE pathologies are not side notes—shrinkage bias from L1, dead latents at scale—these are the whole game, modern recipes explicitly address them
  • Interpretability becomes actionable when features become causal handles—ablation, amplification, steering turn "interpretability" into "debugging & control"

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.