11Representations

⊕Superposition, Sparse Features & Monosemanticity

Canonical Papers

Elhage et al.2022Anthropic

Bricken et al.2023Anthropic

Features are represented not by one neuron each, but as sparse directions in activation space. Formalized via dictionary learning:

\min_{A, s_i} \sum_i \|h_i - A s_i\|_2^2 + \lambda \|s_i\|_1

where $h_i$ are activations, columns of $A$ are *features*, and $s_i$ are sparse codes.

Sparse autoencoders applied to transformer MLP activations recover relatively interpretable, "monosemantic" features.

Key Equation

\min_{A, s_i} \sum_i \|h_i - A s_i\|_2^2 + \lambda \|s_i\|_1

Frontier LMs heavily rely on superposition: neurons implement many overlapping features
Monosemantic dictionaries being applied to Claude-class models for interpretability & safety

What is still poorly explained in textbooks and papers:

Simple geometric story for why superposition is useful (capacity vs interference trade-offs)
Interactive views of how sparse autoencoders carve up activation space into overlapping feature directions

Explore this concept from different angles — like a mathematician would.

≈ Analogy

🔧 Invented to Fix

⚠️ Breaks When