Sparse Autoencoders at Scale: Feature Dictionaries for Mechanistic Interpretability
Canonical Papers
Scaling and evaluating sparse autoencoders
Read paper →Improving Dictionary Learning with Gated Sparse Autoencoders
Read paper →Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Read paper →Core Mathematics
Sparse Autoencoders (SAEs) learn a feature dictionary to recover interpretable directions from dense model activations, solving the superposition problem where networks pack many features into fewer dimensions.
Let be a model activation vector (often residual stream at a layer).
SAE encode → sparse latents, decode → reconstruction:
This makes the key idea concrete: a sparse code $z$ explains the dense activation via a learned dictionary .
Classic SAE objective (reconstruction + sparsity penalty):
Sparsity is what pushes toward monosemantic, human-nameable latents instead of dense uninterpretable factors.
Modern "Top-K / k-sparse" variant (directly control sparsity):
This replaces "tune " with "set exactly k active features per token," improving reconstruction–sparsity frontier.
Interactive Visualization
Why It Matters for Modern Models
- OpenAI trained 16M latent SAE on GPT-4, Anthropic extracted interpretable features from Claude 3—this is how frontier labs actually do interpretability at scale
- Neurons are polysemantic (respond to multiple unrelated concepts)—SAEs give scalable substitute unit: feature latents that are monosemantic
- After teaching superposition (#11), this shows how to actually recover features from real models—turning theory into practical method
- Enables feature-level circuit analysis instead of head/neuron circuits—scalable causal graphs built from interpretable units
- Bridges mechanistic interpretability (#11-13) to alignment (#24-26)—once you can name internal features, you can measure, audit, and intervene
Missing Intuition
What is still poorly explained in textbooks and papers:
- SAE = "learn a parts dictionary for the residual stream"—each column of $W_{\text{dec}}$ is candidate feature direction, sparse $z$ says which parts are present
- Sparsity is the interpretability prior—dense code can represent anything but names nothing, sparse codes force reuse of directions across similar contexts
- Real knob is "explanatory budget per token"—in k-sparse SAEs, $k$ literally sets how many features can explain an activation
- SAE pathologies are not side notes—shrinkage bias from L1, dead latents at scale—these are the whole game, modern recipes explicitly address them
- Interpretability becomes actionable when features become causal handles—ablation, amplification, steering turn "interpretability" into "debugging & control"