13Representations

Transformer Circuits, Induction Heads & Mechanistic Interpretability

Canonical Papers

A Mathematical Framework for Transformer Circuits

Elhage et al.2021Anthropic
Read paper →

In-Context Learning and Induction Heads

Olsson et al.2022Anthropic
Read paper →

Core Mathematics

Decompose transformer computations into linear components on the residual stream:

rl+1=rl+Wlattnrl+Wlmlprlr_{l+1} = r_l + W^{\text{attn}}_l r_l + W^{\text{mlp}}_l r_l

Induction heads: specific attention heads implement an algorithm:

[A][B][A][B][A][B]\dots[A] \rightarrow [B]

by attending from the final [A] token to previous [A] tokens and copying the subsequent token's representation.

The sudden appearance of these heads is tied to a phase transition in in-context learning.

Key Equation
rl+1=rl+Wlattnrl+Wlmlprlr_{l+1} = r_l + W^{\text{attn}}_l r_l + W^{\text{mlp}}_l r_l

Interactive Visualization

Why It Matters for Modern Models

  • These frameworks study GPT-style models, including Llama-3 and Claude-3, by identifying concrete circuits
  • Inform safety research (locating deception-related circuits) and architecture design

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Interactive visualizations of how QK and OV matrices implement algorithms like induction
  • Broader taxonomies of circuit motifs beyond a few toy examples

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.