Representation Learning

Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 16mlive demo
Editorial interpretability illustration of dense activations routed through sparse feature dictionary atoms and reconstructed outputs.

Concept Structure

Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

4prerequisites
2next concepts
2related links

Learning map

Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability
BeforeRepresentation Learning & Embedding GeometryNow4/4 sections readyTryManipulate one control and predict the visible change.Nextcircuit-discovery

Object flow

4/4 sections readyAsk about thisResearch room
ConceptSparse Autoencoders: Feature Dictionaries for Mechanistic InterpretabilityRepresentation Learning
2 sources attachedLocal snapshot ready
concept:representation-learning/sparse-autoencoders

Conceptual Bridge

What should feel connected as you move through this page.

Carry inRepresentation Learning & Embedding Geometry

Bring the mental model from Representation Learning & Embedding Geometry; this page will reuse it instead of restarting from zero.

Work hereSparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Carry outcircuit-discovery

The next edge should feel earned: use the demo prediction here before following circuit-discovery.

Test the linkManipulate one control and predict the visible change.Then continue to circuit-discovery
01

01

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Large models do not usually store one clean concept per neuron.

Instead, many concepts are packed into the same activation vector. A direction might partly mean "Python code", partly mean "HTML tag", and partly mean "list formatting". This is the superposition problem: the model is using the same coordinates for several overlapping features.

A sparse autoencoder (SAE) tries to learn a better coordinate system.

  • The encoder looks at a dense activation and asks which hidden features are present.
  • The decoder turns those hidden features back into a reconstruction of the original activation.
  • The sparsity constraint says only a small number of features are allowed to fire for each token.

That is intended to make the latent code behave like a parts list for the residual stream. Instead of saying "this activation is a mysterious 4096-dimensional vector", we say "this activation seems to use a few reusable feature directions". The practical teaching knob is the explanatory budget per activation: how many features are you allowed to use before interpretability starts to melt into dense mush again?

02

02

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Let xRdx \in \mathbb{R}^d be an activation vector from some model layer, often the residual stream.

Encode to sparse latents, decode back to the activation

An SAE maps xx into a sparse latent code zRmz \in \mathbb{R}^m and reconstructs it as x^\hat x:

z=ReLU(Wenc(xbpre)+benc),x^=Wdecz+bpre.z = \mathrm{ReLU}(W_{\text{enc}}(x - b_{\text{pre}}) + b_{\text{enc}}), \qquad \hat x = W_{\text{dec}} z + b_{\text{pre}}.

The columns of WdecW_{\text{dec}} act like a learned feature dictionary. If zjz_j is active, the jjth feature direction contributes to the reconstruction.

Reconstruction plus sparsity

The classic objective balances faithfulness and simplicity:

L=xx^22+λz1.\mathcal{L} = \lVert x - \hat x \rVert_2^2 + \lambda \lVert z \rVert_1.
  • xx^22\lVert x - \hat x \rVert_2^2 asks the dictionary to explain the real activation.
  • λz1\lambda \lVert z \rVert_1 punishes too many active features.

If λ\lambda is too small, the code becomes dense and hard to interpret. If it is too large, reconstruction worsens and useful structure may be missed or pushed into inactive latents.

Top-k sparse coding

Some recent SAE work replaces the soft L1L_1 penalty with a hard "only keep the best kk features" rule:

a=ReLU(Wenc(xbpre)+benc),z=TopK(a,k),x^=Wdecz+bpre,L=xx^22.a = \mathrm{ReLU}(W_{\text{enc}}(x - b_{\text{pre}}) + b_{\text{enc}}), \qquad z = \mathrm{TopK}(a, k), \qquad \hat x = W_{\text{dec}}z + b_{\text{pre}}, \qquad \mathcal{L} = \lVert x - \hat x \rVert_2^2.

Here TopK(a,k)\mathrm{TopK}(a, k) keeps the kk largest nonnegative activation scores and sets the rest to zero.

Now kk is explicit: each activation gets a fixed budget of kept latent slots. This is easier to reason about pedagogically and makes the reconstruction versus interpretability tradeoff visible in one number.

03

03

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt
import numpy as np

rs = np.random.RandomState(0)
n, d, m, k = 256, 10, 24, 3
D_true = rs.randn(d, m); D_true /= np.linalg.norm(D_true, axis=0, keepdims=True)
Z_true = np.zeros((n, m))
for row in Z_true:
    row[rs.choice(m, k, replace=False)] = rs.uniform(0.5, 1.5, k)
X = Z_true @ D_true.T + 0.02 * rs.randn(n, d)


def train_sae(lam, steps=500, lr=0.05):
    W_enc, W_dec = 0.1 * rs.randn(m, d), 0.1 * rs.randn(d, m)
    for _ in range(steps):
        pre = X @ W_enc.T
        Z = np.maximum(pre, 0.0)
        X_hat = Z @ W_dec.T
        err = (X_hat - X) / n
        grad_dec = err.T @ Z
        grad_z = err @ W_dec + lam * (Z > 0) / n
        W_enc -= lr * ((grad_z * (pre > 0)).T @ X)
        W_dec -= lr * grad_dec
        W_dec /= np.linalg.norm(W_dec, axis=0, keepdims=True) + 1e-9
    Z = np.maximum(X @ W_enc.T, 0.0)
    X_hat = Z @ W_dec.T
    return np.mean((X - X_hat) ** 2), np.mean(np.count_nonzero(Z > 1e-3, axis=1))


faithful = train_sae(lam=0.0005)
sparse = train_sae(lam=0.30)
print("low lambda:  mse %.4f, avg active latents %.1f" % faithful)
print("high lambda: mse %.4f, avg active latents %.1f" % sparse)
assert sparse[1] < faithful[1] and sparse[0] > faithful[0]

This toy SAE trains an encoder and decoder dictionary on synthetic activations. Raising λ\lambda makes the latent code sparser on average, but the reconstruction error rises: exactly the local tradeoff the claim is about.

04

04

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the demo to explore the main SAE design tradeoff:

  • how reconstruction error falls as more features are allowed to fire,
  • how this toy frontier illustrates an L1L_1 shrinkage failure mode and contrasts it with TopK and gated-style mechanisms,
  • and how "better reconstruction" is not the same thing as "cleaner, more interpretable features".

Live Concept Demo

Explore Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned
Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Prediction open01 / Intuition
Editorial interpretability illustration of dense activations routed through sparse feature dictionary atoms and reconstructed outputs.
Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability should make visible.

Visual Inquiry

Make the image answer a mathematical question

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

4/4 stages readyLive demo connected
Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

article · 2023Towards Monosemanticity: Decomposing Language Models With Dictionary LearningBricken et al.

Grounds sparse autoencoders as dictionary-learning tools for decomposing activations into more interpretable features.

Open source
paper · 2024Scaling and evaluating sparse autoencodersGao et al.

Grounds SAE scaling and evaluation tradeoffs for larger language-model activations.

Open source

Claim Review

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources2 references

bricken-2023-monosemanticity, gao-2024-scaling-sae

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedSparse autoencoders learn decoder dictionaries that reconstruct language-model activations from sparse latent codes, trading reconstruction error against sparsity so each activation uses a small set of active latents.Claim metadata: source checked

Bricken et al. ground SAEs as dictionary-learning tools for decomposing model activations into learned features. Gao et al. describe SAEs as reconstructing language-model activations from a sparse bottleneck and frame training around the reconstruction-sparsity tradeoff. Local math/code witness the bounded mechanism.

Sources: Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Scaling and evaluating sparse autoencodersCertifies only SAE reconstruction/sparsity: decoder dictionaries reconstruct LM activations from sparse latents under sparsity objectives. It does not certify universal monosemanticity, causal completeness, interpretability, steering/circuit usefulness, or TopK/gated dominance.A bounded review summary is present; still check caveats and exact source scope.

Bricken et al. support SAE dictionary learning over transformer activations with a ReLU encoder, decoder/dictionary reconstruction, MSE plus L1 sparsity, and hidden activations as learned features. Gao et al. support language-model activation reconstruction from sparse bottlenecks, L0/MSE evaluation, and reconstruction-sparsity plus TopK direct sparsity control. Local math and code witness the bounded mechanism; the synthetic demo is intentionally outside claim refs.

Reviewer: codex+oracle+codex-5.3; reviewed 2026-05-08

Practice Loop

Try the idea before it explains itself

Sparse autoencoders learn a reusable dictionary of feature directions so dense model activations can be explained by a small set of interpretable latent factors.

Readiness0/3 checks ready
Predict

Before touching the demo, predict one visible change that should happen in Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Object research drawerClose
ConceptSparse Autoencoders: Feature Dictionaries for Mechanistic InterpretabilityRepresentation Learning

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.
Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptRepresentation Learning

Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability

Anchored question

What is the smallest example that makes Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action
Local action draft

This draft stays locally in this browser for concept:representation-learning/sparse-autoencoders.

No local draft saved.
Evidence to inspect
  • Source ids to inspect: bricken-2023-monosemanticity, gao-2024-scaling-sae
  • Definition, prerequisite, and contrast concept links
  • The equation or code witness that makes the concept operational
  • One demo state that shows the invariant instead of a slogan
What would resolve this
  • The learner can state the mechanism in their own words
  • The learner can name the prerequisite that would repair confusion
  • The learner can predict how the mechanism changes under one perturbation
Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability Object key: concept:representation-learning/sparse-autoencoders Context: Representation Learning Anchor id: concept/concept-notebook/representation-learning/sparse-autoencoders Open question: What is the smallest example that makes Sparse Autoencoders: Feature Dictionaries for Mechanistic Interpretability click without losing the math? Evidence to inspect: - Source ids to inspect: bricken-2023-monosemanticity, gao-2024-scaling-sae - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object
concept/concept-notebook/representation-learning/sparse-autoencoders concept:representation-learning/sparse-autoencoders