13Representations

⊛Transformer Circuits, Induction Heads & Mechanistic Interpretability

Canonical Papers

Elhage et al.2021Anthropic

Olsson et al.2022Anthropic

Decompose transformer computations into linear components on the residual stream:

r_{l+1} = r_l + W^{\text{attn}}_l r_l + W^{\text{mlp}}_l r_l

Induction heads: specific attention heads implement an algorithm:

[A][B]\dots[A] \rightarrow [B]

by attending from the final [A] token to previous [A] tokens and copying the subsequent token's representation.

The sudden appearance of these heads is tied to a phase transition in in-context learning.

Key Equation

r_{l+1} = r_l + W^{\text{attn}}_l r_l + W^{\text{mlp}}_l r_l

These frameworks study GPT-style models, including Llama-3 and Claude-3, by identifying concrete circuits
Inform safety research (locating deception-related circuits) and architecture design

What is still poorly explained in textbooks and papers:

Interactive visualizations of how QK and OV matrices implement algorithms like induction
Broader taxonomies of circuit motifs beyond a few toy examples

Explore this concept from different angles — like a mathematician would.

≈ Analogy

🔄 Same Technique

⚠️ Breaks When