11Representations

Superposition, Sparse Features & Monosemanticity

Canonical Papers

Toy Models of Superposition

Elhage et al.2022Anthropic
Read paper →

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

Bricken et al.2023Anthropic
Read paper →

Core Mathematics

Features are represented not by one neuron each, but as sparse directions in activation space. Formalized via dictionary learning:

minA,siihiAsi22+λsi1\min_{A, s_i} \sum_i \|h_i - A s_i\|_2^2 + \lambda \|s_i\|_1

where hih_i are activations, columns of AA are *features*, and sis_i are sparse codes.

Sparse autoencoders applied to transformer MLP activations recover relatively interpretable, "monosemantic" features.

Key Equation
minA,siihiAsi22+λsi1\min_{A, s_i} \sum_i \|h_i - A s_i\|_2^2 + \lambda \|s_i\|_1

Interactive Visualization

Why It Matters for Modern Models

  • Frontier LMs heavily rely on superposition: neurons implement many overlapping features
  • Monosemantic dictionaries being applied to Claude-class models for interpretability & safety

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Simple geometric story for why superposition is useful (capacity vs interference trade-offs)
  • Interactive views of how sparse autoencoders carve up activation space into overlapping feature directions

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.