Mechanistic Interpretability

Reverse-engineering neural computation

What Do Networks Compute?

Neural networks learn rich internal representations, but understanding what they compute remains challenging. Mechanistic interpretability aims to reverse-engineer these computations into human-understandable algorithms.

The goal: not just that a network works, but how it works, neuron by neuron, layer by layer.

Reading Activations

The first step is understanding what patterns of activation mean. Some neurons respond to interpretable features — edges, colors, concepts. Others encode more abstract properties.

The visualization shows activations across layers and time. Highlighted neurons show potentially interpretable patterns.

Sparsity:80%

Superposition

A key challenge: networks encode more features than they have neurons. Features are represented in superposition — overlapping, distributed patterns that interfere with each other.

Superposition Hypothesis

If features are sparse (rarely active together), a network can encode n features in m dimensions where n > m by tolerating some interference.

This is efficient but makes interpretation harder — we can't just look at individual neurons.

Sparse Autoencoders

Sparse autoencoders (SAEs) attempt to disentangle superposed representations. They learn a dictionary of features that reconstructs activations:

SAE Objective

min ||x - Wf(W^Tx)||² + λ||f(W^Tx)||₁

The sparsity penalty encourages each input to be explained by a small number of features, ideally corresponding to interpretable concepts.

Features:8

Feature Interpretation

Once we have features, we can study them:

→ Find examples that maximally activate each feature
→ Ablate features to see their causal effect
→ Track how features compose across layers

The dream: a complete catalog of features and circuits that explains everything a model knows and computes.

Circuits

Features don't act in isolation — they connect into circuits. A circuit is a subgraph of the computation that implements a specific algorithm.

Example circuits discovered in language models:

• Induction heads (in-context learning)
• Indirect object identification
• Greater-than comparison

Understanding circuits gives us mechanistic insight into model behavior and potential failure modes.