Superposition, Sparse Features & Monosemanticity
Canonical Papers
Toy Models of Superposition
Read paper →Towards Monosemanticity: Decomposing Language Models with Dictionary Learning
Read paper →Core Mathematics
Features are represented not by one neuron each, but as sparse directions in activation space. Formalized via dictionary learning:
where are activations, columns of are *features*, and are sparse codes.
Sparse autoencoders applied to transformer MLP activations recover relatively interpretable, "monosemantic" features.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Frontier LMs heavily rely on superposition: neurons implement many overlapping features
- Monosemantic dictionaries being applied to Claude-class models for interpretability & safety
Missing Intuition
What is still poorly explained in textbooks and papers:
- Simple geometric story for why superposition is useful (capacity vs interference trade-offs)
- Interactive views of how sparse autoencoders carve up activation space into overlapping feature directions