Bring the mental model from Representation Learning & Embedding Geometry; this page will reuse it instead of restarting from zero.
Representation Learning
Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability
Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Concept Structure
Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability
Start with the picture, metaphor, or geometric mechanism.
Make the objects explicit and connect them with notation.
Mirror the equations with runnable implementation details.
Manipulate the mechanism and watch the idea respond.
Learning map
Sparse Autoencoders: Feature Dictionaries for Mechanistic InterpretabilityConceptual Bridge
What should feel connected as you move through this page.
Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.
The next edge should feel earned: use the demo prediction here before following circuit-discovery.
01
Intuition
Build the mental picture first so the rest of the page has something to attach to.
Large models do not usually store one clean concept per neuron.
Instead, many concepts are packed into the same activation vector. A direction might partly mean "Python code", partly mean "HTML tag", and partly mean "list formatting". This is the superposition problem: the model is using the same coordinates for several overlapping features.
A sparse autoencoder (SAE) tries to learn a better coordinate system.
- The encoder looks at a dense activation and asks which hidden features are present.
- The decoder turns those hidden features back into a reconstruction of the original activation.
- The sparsity constraint says only a small number of features are allowed to fire for each token.
That is intended to make the latent code behave like a parts list for the residual stream. Instead of saying "this activation is a mysterious 4096-dimensional vector", we say "this activation seems to use a few reusable feature directions". The practical teaching knob is the explanatory budget per activation: how many features are you allowed to use before interpretability starts to melt into dense mush again?
02
Math
Translate the story into symbols, assumptions, and a derivation you can inspect.
Let be an activation vector from some model layer, often the residual stream.
Encode to sparse latents, decode back to the activation
An SAE maps into a sparse latent code and reconstructs it as :
The columns of act like a learned feature dictionary. If is active, the th feature direction contributes to the reconstruction.
Reconstruction plus sparsity
The classic objective balances faithfulness and simplicity:
- asks the dictionary to explain the real activation.
- punishes too many active features.
If is too small, the code becomes dense and hard to interpret. If it is too large, reconstruction worsens and useful structure may be missed or pushed into inactive latents.
Top-k sparse coding
Some recent SAE work replaces the soft penalty with a hard "only keep the best features" rule:
Here keeps the largest nonnegative activation scores and sets the rest to zero.
Now is explicit: each activation gets a fixed budget of kept latent slots. This is easier to reason about pedagogically and makes the reconstruction versus interpretability tradeoff visible in one number.
03
Code
Keep the implementation aligned with the notation so the algorithm is legible.
import numpy as np
rs = np.random.RandomState(0)
n, d, m, k = 256, 10, 24, 3
D_true = rs.randn(d, m); D_true /= np.linalg.norm(D_true, axis=0, keepdims=True)
Z_true = np.zeros((n, m))
for row in Z_true:
row[rs.choice(m, k, replace=False)] = rs.uniform(0.5, 1.5, k)
X = Z_true @ D_true.T + 0.02 * rs.randn(n, d)
def train_sae(lam, steps=500, lr=0.05):
W_enc, W_dec = 0.1 * rs.randn(m, d), 0.1 * rs.randn(d, m)
for _ in range(steps):
pre = X @ W_enc.T
Z = np.maximum(pre, 0.0)
X_hat = Z @ W_dec.T
err = (X_hat - X) / n
grad_dec = err.T @ Z
grad_z = err @ W_dec + lam * (Z > 0) / n
W_enc -= lr * ((grad_z * (pre > 0)).T @ X)
W_dec -= lr * grad_dec
W_dec /= np.linalg.norm(W_dec, axis=0, keepdims=True) + 1e-9
Z = np.maximum(X @ W_enc.T, 0.0)
X_hat = Z @ W_dec.T
return np.mean((X - X_hat) ** 2), np.mean(np.count_nonzero(Z > 1e-3, axis=1))
faithful = train_sae(lam=0.0005)
sparse = train_sae(lam=0.30)
print("low lambda: mse %.4f, avg active latents %.1f" % faithful)
print("high lambda: mse %.4f, avg active latents %.1f" % sparse)
assert sparse[1] < faithful[1] and sparse[0] > faithful[0]
This toy SAE trains an encoder and decoder dictionary on synthetic activations. Raising makes the latent code sparser on average, but the reconstruction error rises: exactly the local tradeoff the claim is about.
04
Interactive Demo
Use direct manipulation to connect the explanation to a moving system.
Use the demo to explore the main SAE design tradeoff:
- how reconstruction error falls as more features are allowed to fire,
- how this toy frontier illustrates an shrinkage failure mode and contrasts it with
TopKand gated-style mechanisms, - and how "better reconstruction" is not the same thing as "cleaner, more interpretable features".
Live Concept Demo
Explore Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability
The stage is code-native and interactive. Use it to test the explanation against the mechanism.
Manipulate one control and predict the visible change.
Commit to what Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability should make visible before reading the result.
After The First Pass
Turn the concept into an inspected object.
Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.
Mechanism Storyboard
See the idea move before the page explains it
Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Start with the picture, metaphor, or geometric mechanism.
Before reading further, choose the kind of change Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability should make visible.
Visual Inquiry
Make the image answer a mathematical question
Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.
Which visible object should carry the first intuition?
Pick the cue that should make Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability easier to reason about before the page gives the answer.
Source Grounding
Canonical references for the mechanism on this page.
Grounds sparse autoencoders as dictionary-learning tools for decomposing activations into more interpretable features.
Open sourceGrounds SAE scaling and evaluation tradeoffs for larger language-model activations.
Open sourceClaim Review
Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.
Claims without a substantive review badge still need exact source-support review.
bricken-2023-monosemanticity, gao-2024-scaling-sae
Use equation, code, and demo objects to check whether the source support is operational.
Bricken et al. ground SAEs as dictionary-learning tools for decomposing model activations into learned features. Gao et al. describe SAEs as reconstructing language-model activations from a sparse bottleneck and frame training around the reconstruction-sparsity tradeoff. Local math/code witness the bounded mechanism.
Sources: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Scaling and evaluating sparse autoencodersCertifies only SAE reconstruction/sparsity: decoder dictionaries reconstruct LM activations from sparse latents under sparsity objectives. It does not certify universal monosemanticity, causal completeness, interpretability, steering/circuit usefulness, or TopK/gated dominance.A bounded review summary is present; still check caveats and exact source scope.Bricken et al. support SAE dictionary learning over transformer activations with a ReLU encoder, decoder/dictionary reconstruction, MSE plus L1 sparsity, and hidden activations as learned features. Gao et al. support language-model activation reconstruction from sparse bottlenecks, L0/MSE evaluation, and reconstruction-sparsity plus TopK direct sparsity control. Local math and code witness the bounded mechanism; the synthetic demo is intentionally outside claim refs.
Reviewer: codex+oracle+codex-5.3; reviewed 2026-05-08Source support candidates
article 2023Towards Monosemanticity: Decomposing Language Models With Dictionary LearningGrounds sparse autoencoders as dictionary-learning tools for decomposing activations into more interpretable features.
paper 2024Scaling and evaluating sparse autoencodersGrounds SAE scaling and evaluation tradeoffs for larger language-model activations.
Practice Loop
Try the idea before it explains itself
Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.
Before touching the demo, predict one visible change that should happen in Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability.
Reveal when your model needs a nudge.
Reveal when your model needs a nudge.
Reveal when your model needs a nudge.
A concrete answer is on the canvas.
The answer names why the claim should hold.
It touches the page context or a neighboring idea.
Research Room
Attach the question to an exact object
Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.Open the draft below to save one note and next action in this browser.
Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability
What is the smallest example that makes Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability click without losing the math?
Local action draftNo local draft saved yetExpand only when ready to capture one local next action
This draft stays locally in this browser for concept:representation-learning/sparse-autoencoders.
- Source ids to inspect: bricken-2023-monosemanticity, gao-2024-scaling-sae
- Definition, prerequisite, and contrast concept links
- The equation or code witness that makes the concept operational
- One demo state that shows the invariant instead of a slogan
- The learner can state the mechanism in their own words
- The learner can name the prerequisite that would repair confusion
- The learner can predict how the mechanism changes under one perturbation
I am working in Continuous Function's research reading room. Object: concept - Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability Object key: concept:representation-learning/sparse-autoencoders Context: Representation Learning Anchor id: concept/concept-notebook/representation-learning/sparse-autoencoders Open question: What is the smallest example that makes Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability click without losing the math? Evidence to inspect: - Source ids to inspect: bricken-2023-monosemanticity, gao-2024-scaling-sae - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.
concept/concept-notebook/representation-learning/sparse-autoencoders
concept:representation-learning/sparse-autoencoders