Transformer Circuits, Induction Heads & Mechanistic Interpretability
Canonical Papers
A Mathematical Framework for Transformer Circuits
Read paper →In-Context Learning and Induction Heads
Read paper →Core Mathematics
Decompose transformer computations into linear components on the residual stream:
Induction heads: specific attention heads implement an algorithm:
by attending from the final [A] token to previous [A] tokens and copying the subsequent token's representation.
The sudden appearance of these heads is tied to a phase transition in in-context learning.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- These frameworks study GPT-style models, including Llama-3 and Claude-3, by identifying concrete circuits
- Inform safety research (locating deception-related circuits) and architecture design
Missing Intuition
What is still poorly explained in textbooks and papers:
- Interactive visualizations of how QK and OV matrices implement algorithms like induction
- Broader taxonomies of circuit motifs beyond a few toy examples