Mechanistic Interpretability
Reverse-engineering neural computation
What Do Networks Compute?
Neural networks learn rich internal representations, but understanding what they compute remains challenging. Mechanistic interpretability aims to reverse-engineer these computations into human-understandable algorithms.
The goal: not just that a network works, but how it works, neuron by neuron, layer by layer.
Reading Activations
The first step is understanding what patterns of activation mean. Some neurons respond to interpretable features — edges, colors, concepts. Others encode more abstract properties.
The visualization shows activations across layers and time. Highlighted neurons show potentially interpretable patterns.
Superposition
A key challenge: networks encode more features than they have neurons. Features are represented in superposition — overlapping, distributed patterns that interfere with each other.
If features are sparse (rarely active together), a network can encode n features in m dimensions where n > m by tolerating some interference.
This is efficient but makes interpretation harder — we can't just look at individual neurons.
Sparse Autoencoders
Sparse autoencoders (SAEs) attempt to disentangle superposed representations. They learn a dictionary of features that reconstructs activations:
min ||x - Wf(WTx)||² + λ||f(WTx)||₁
The sparsity penalty encourages each input to be explained by a small number of features, ideally corresponding to interpretable concepts.
Feature Interpretation
Once we have features, we can study them:
- → Find examples that maximally activate each feature
- → Ablate features to see their causal effect
- → Track how features compose across layers
The dream: a complete catalog of features and circuits that explains everything a model knows and computes.
Circuits
Features don't act in isolation — they connect into circuits. A circuit is a subgraph of the computation that implements a specific algorithm.
Example circuits discovered in language models:
- • Induction heads (in-context learning)
- • Indirect object identification
- • Greater-than comparison
Understanding circuits gives us mechanistic insight into model behavior and potential failure modes.