Representation Learning

Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 16mlive demo

Back to Representation Learning Next: circuit-discovery

Editorial interpretability illustration of dense activations routed through sparse feature dictionary atoms and reconstructed outputs.

Concept Structure

Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

4prerequisites

2next concepts

2related links

Learning map

Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

BeforeRepresentation Learning & Embedding GeometryNow4/4 sections readyTryManipulate one control and predict the visible change.Nextcircuit-discovery

Object flow

4/4 sections readyAsk about this Research room

ConceptSparse Autoencoders: Feature Dictionaries for Mechanistic Interpretab...Representation Learning EquationSparse Autoencoders: Feature Dictionaries for Mechanistic Interpretab...Exact equation object CodeSparse Autoencoders: Feature Dictionaries for Mechanistic Interpretab...Exact code witness DemoSparse Autoencoders: Feature Dictionaries for Mechanistic Interpretab...Visualization object ClaimSparse autoencoders learn decoder dictionaries that reconstruct langu...Exact claim check SourceTowards Monosemanticity: Decomposing Language Models With Dictionary...Exact source object

ConceptSparse Autoencoders: Feature Dictionaries for Mechanistic InterpretabilityRepresentation Learning

2 sources attachedLocal snapshot ready

concept:representation-learning/sparse-autoencoders

Codewitness nearby Predictbefore reveal Roomobject handoff

Conceptual Bridge

What should feel connected as you move through this page.

Carry inRepresentation Learning & Embedding Geometry

Bring the mental model from Representation Learning & Embedding Geometry; this page will reuse it instead of restarting from zero.

Work hereSparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Carry outcircuit-discovery

The next edge should feel earned: use the demo prediction here before following circuit-discovery.

Test the linkManipulate one control and predict the visible change.Then continue to circuit-discovery

01IntuitionStart with the picture, metaphor, or geometric mechanism.02MathMake the objects explicit and connect them with notation.03CodeMirror the equations with runnable implementation details.04Interactive DemoManipulate the mechanism and watch the idea respond.

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Large models do not usually store one clean concept per neuron.

Instead, many concepts are packed into the same activation vector. A direction might partly mean "Python code", partly mean "HTML tag", and partly mean "list formatting". This is the superposition problem: the model is using the same coordinates for several overlapping features.

A sparse autoencoder (SAE) tries to learn a better coordinate system.

The encoder looks at a dense activation and asks which hidden features are present.
The decoder turns those hidden features back into a reconstruction of the original activation.
The sparsity constraint says only a small number of features are allowed to fire for each token.

That is intended to make the latent code behave like a parts list for the residual stream. Instead of saying "this activation is a mysterious 4096-dimensional vector", we say "this activation seems to use a few reusable feature directions". The practical teaching knob is the explanatory budget per activation: how many features are you allowed to use before interpretability starts to melt into dense mush again?

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Equation 1z = \mathrm{ReLU}(W_{\text{enc}}(x - b_{\text{pre}}) + b_{\text{enc}}), \qquad \hat x = W_{\t...Equation 2\mathcal{L} = \lVert x - \hat x \rVert_2^2 + \lambda \lVert z \rVert_1.

Let $x \in \mathbb{R}^d$ be an activation vector from some model layer, often the residual stream.

Encode to sparse latents, decode back to the activation

An SAE maps $x$ into a sparse latent code $z \in \mathbb{R}^m$ and reconstructs it as $\hat x$ :

z = \mathrm{ReLU}(W_{\text{enc}}(x - b_{\text{pre}}) + b_{\text{enc}}), \qquad \hat x = W_{\text{dec}} z + b_{\text{pre}}.

The columns of $W_{\text{dec}}$ act like a learned feature dictionary. If $z_j$ is active, the $j$ th feature direction contributes to the reconstruction.

Reconstruction plus sparsity

The classic objective balances faithfulness and simplicity:

\mathcal{L} = \lVert x - \hat x \rVert_2^2 + \lambda \lVert z \rVert_1.

$\lVert x - \hat x \rVert_2^2$ asks the dictionary to explain the real activation.
$\lambda \lVert z \rVert_1$ punishes too many active features.

If $\lambda$ is too small, the code becomes dense and hard to interpret. If it is too large, reconstruction worsens and useful structure may be missed or pushed into inactive latents.

Top-k sparse coding

Some recent SAE work replaces the soft $L_1$ penalty with a hard "only keep the best $k$ features" rule:

a = \mathrm{ReLU}(W_{\text{enc}}(x - b_{\text{pre}}) + b_{\text{enc}}), \qquad z = \mathrm{TopK}(a, k), \qquad \hat x = W_{\text{dec}}z + b_{\text{pre}}, \qquad \mathcal{L} = \lVert x - \hat x \rVert_2^2.

Here $\mathrm{TopK}(a, k)$ keeps the $k$ largest nonnegative activation scores and sets the rest to zero.

Now $k$ is explicit: each activation gets a fixed budget of kept latent slots. This is easier to reason about pedagogically and makes the reconstruction versus interpretability tradeoff visible in one number.

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

Code witness 1import numpy as np rs = np.random.RandomState(0) n, d, m, k = 256, 10, 24, 3 D_true = rs.rand...python

import numpy as np

rs = np.random.RandomState(0)
n, d, m, k = 256, 10, 24, 3
D_true = rs.randn(d, m); D_true /= np.linalg.norm(D_true, axis=0, keepdims=True)
Z_true = np.zeros((n, m))
for row in Z_true:
    row[rs.choice(m, k, replace=False)] = rs.uniform(0.5, 1.5, k)
X = Z_true @ D_true.T + 0.02 * rs.randn(n, d)


def train_sae(lam, steps=500, lr=0.05):
    W_enc, W_dec = 0.1 * rs.randn(m, d), 0.1 * rs.randn(d, m)
    for _ in range(steps):
        pre = X @ W_enc.T
        Z = np.maximum(pre, 0.0)
        X_hat = Z @ W_dec.T
        err = (X_hat - X) / n
        grad_dec = err.T @ Z
        grad_z = err @ W_dec + lam * (Z > 0) / n
        W_enc -= lr * ((grad_z * (pre > 0)).T @ X)
        W_dec -= lr * grad_dec
        W_dec /= np.linalg.norm(W_dec, axis=0, keepdims=True) + 1e-9
    Z = np.maximum(X @ W_enc.T, 0.0)
    X_hat = Z @ W_dec.T
    return np.mean((X - X_hat) ** 2), np.mean(np.count_nonzero(Z > 1e-3, axis=1))


faithful = train_sae(lam=0.0005)
sparse = train_sae(lam=0.30)
print("low lambda:  mse %.4f, avg active latents %.1f" % faithful)
print("high lambda: mse %.4f, avg active latents %.1f" % sparse)
assert sparse[1] < faithful[1] and sparse[0] > faithful[0]

This toy SAE trains an encoder and decoder dictionary on synthetic activations. Raising $\lambda$ makes the latent code sparser on average, but the reconstruction error rises: exactly the local tradeoff the claim is about.

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the demo to explore the main SAE design tradeoff:

how reconstruction error falls as more features are allowed to fire,
how this toy frontier illustrates an $L_1$ shrinkage failure mode and contrasts it with TopK and gated-style mechanisms,
and how "better reconstruction" is not the same thing as "cleaner, more interpretable features".

Live Concept Demo

Explore Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned

Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Prediction open01 / Intuition

Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability should make visible.

Visual Inquiry

Make the image answer a mathematical question

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

4/4 stages readyLive demo connected

Visual cueWhich visible object should carry the first intuition?

Inspection depth2/4

Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

article · 2023Towards Monosemanticity: Decomposing Language Models With Dictionary LearningBricken et al.

Grounds sparse autoencoders as dictionary-learning tools for decomposing activations into more interpretable features.

Open source

paper · 2024Scaling and evaluating sparse autoencodersGao et al.

Grounds SAE scaling and evaluation tradeoffs for larger language-model activations.

Open source

Claim Review

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources2 references

bricken-2023-monosemanticity, gao-2024-scaling-sae

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedSparse autoencoders learn decoder dictionaries that reconstruct language-model activations from sparse latent codes, trading reconstruction error against sparsity so each activation uses a small set of active latents.Claim metadata: source checked

Bricken et al. ground SAEs as dictionary-learning tools for decomposing model activations into learned features. Gao et al. describe SAEs as reconstructing language-model activations from a sparse bottleneck and frame training around the reconstruction-sparsity tradeoff. Local math/code witness the bounded mechanism.

Sources: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Scaling and evaluating sparse autoencodersCertifies only SAE reconstruction/sparsity: decoder dictionaries reconstruct LM activations from sparse latents under sparsity objectives. It does not certify universal monosemanticity, causal completeness, interpretability, steering/circuit usefulness, or TopK/gated dominance.A bounded review summary is present; still check caveats and exact source scope.

Bricken et al. support SAE dictionary learning over transformer activations with a ReLU encoder, decoder/dictionary reconstruction, MSE plus L1 sparsity, and hidden activations as learned features. Gao et al. support language-model activation reconstruction from sparse bottlenecks, L0/MSE evaluation, and reconstruction-sparsity plus TopK direct sparsity control. Local math and code witness the bounded mechanism; the synthetic demo is intentionally outside claim refs.

Reviewer: codex+oracle+codex-5.3; reviewed 2026-05-08

source-span-bricken-2023-monosemanticity source-span-gao-2024-scaling-sae math-object-1 math-object-2 code-witness-1

Source support candidates

article 2023Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Grounds sparse autoencoders as dictionary-learning tools for decomposing activations into more interpretable features.

paper 2024Scaling and evaluating sparse autoencoders

Grounds SAE scaling and evaluation tradeoffs for larger language-model activations.

Mechanism witnesses

Equation 1

z = \mathrm{ReLU}(W_{\text{enc}}(x - b_{\text{pre}}) + b_{\text{enc}}), \qquad \hat x = W_{\text{dec}} z + b_{\text{pre}}.

Equation 2

\mathcal{L} = \lVert x - \hat x \rVert_2^2 + \lambda \lVert z \rVert_1.

Code witness 1import numpy as np rs = np.random.RandomState(0) n, d, m, k = 256, 10, 24, 3 D_true = rs.rand...Demo stateLive mechanism probe

Practice Loop

Try the idea before it explains itself

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Readiness0/3 checks ready

Predict

Before touching the demo, predict one visible change that should happen in Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Your answer canvas

Local checks

Claim

A concrete answer is on the canvas.

Mechanism

The answer names why the claim should hold.

Bridge

It touches the page context or a neighboring idea.

Misconception check

Object research drawerClose

ConceptSparse Autoencoders: Feature Dictionaries for Mechanistic InterpretabilityRepresentation Learning

Code witness comparisonSparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability code witness 1assert sparse[1] < faithful[1] and sparse[0] > faithful[0]Prediction before revealSparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability interactive...Manipulate one control and predict the visible change.

Grounded room questionWhat is the smallest example that makes Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability click without losing the math?Local snapshot ready

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.

Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptRepresentation Learning

Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

Anchored question

What is the smallest example that makes Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action

Local action draft

This draft stays locally in this browser for concept:representation-learning/sparse-autoencoders.

Draft noteNext action

No local draft saved.

Evidence to inspect

Source ids to inspect: bricken-2023-monosemanticity, gao-2024-scaling-sae
Definition, prerequisite, and contrast concept links
The equation or code witness that makes the concept operational
One demo state that shows the invariant instead of a slogan

What would resolve this

The learner can state the mechanism in their own words
The learner can name the prerequisite that would repair confusion
The learner can predict how the mechanism changes under one perturbation

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability Object key: concept:representation-learning/sparse-autoencoders Context: Representation Learning Anchor id: concept/concept-notebook/representation-learning/sparse-autoencoders Open question: What is the smallest example that makes Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability click without losing the math? Evidence to inspect: - Source ids to inspect: bricken-2023-monosemanticity, gao-2024-scaling-sae - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object

concept/concept-notebook/representation-learning/sparse-autoencoders
concept:representation-learning/sparse-autoencoders

Learning Map

Before / Now / Try / Next

BeforeRepresentation Learning & Embedding Geometry

NowIntuition → Math → Code → Demo

TryManipulate one control and predict the visible change.

Nextcircuit-discovery

Intuitionready
Mathready
Codeready
Interactive Demoready

Object Companion

Ask beside the selected object

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Representation Learning concept. Learning surface: Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability. What this page says: Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors. Current section: Intuition, math, code, and interactive demo. Suggested next step: Manipulate one control and predict the visible change.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Explain the central idea in plain language, then restate it with the exact math objects from the page. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.

Domain

Representation Learning

representation-learningsparse-autoencodersinterpretabilitymechanistic-interpretabilityautoencodersfeature-steering

Prerequisites

Representation Learning & Embedding Geometry superposition probing induction-heads

Leads To

circuit-discovery activation-steering

Scaled Dot-Product Attention & Transformer Layers RLHF: Reward Modeling + KL-Regularized Policy Optimization

Within this domain