Automated Circuit Discovery: Patching, Attribution & Decomposition at Scale
Canonical Papers
Towards Automated Circuit Discovery for Mechanistic Interpretability
Read paper →Attribution Patching Outperforms Automated Circuit Discovery
Read paper →Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition
Read paper →Core Mathematics
Automated circuit discovery turns mechanistic interpretability into a search/scoring/pruning problem over the model's computational graph, moving from hand-crafted case studies to scalable pipelines.
Activation patching as causal intervention (edge/node ablation):
This measures causal effect: replace edge with value from corrupted run, measure impact on loss.
Attribution patching (first-order/Taylor approximation):
Linearized causal estimate—faster than full patching, but approximation can be unfaithful.
Edge scoring + pruning (circuit extraction):
Rank edges by importance, prune to minimal circuit that preserves behavior.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Moves mechanistic interpretability from "one-off archaeology" to repeatable pipeline—essential for scaling to frontier models
- 2024-2025 trend: ACDC (slow patching) → EAP (faster approximation) → CD-T (decomposition in seconds)—speed is the bottleneck
- Enables target selection for steering, safety interventions, debugging—you need to find the circuit before you can edit it
- After SAEs (#27), this shows how to build feature-level circuits automatically, not just head/neuron circuits
- Bridges to automated red-teaming and interpretability evaluations at scale—necessary for AI safety pipelines
Missing Intuition
What is still poorly explained in textbooks and papers:
- Activation patching is causal; attribution patching is linearized causal—approximation can be useful even when unfaithful pointwise
- Metric choice is everything—some metrics create degenerate gradients (zero-gradient at optimum), automated methods inherit these failures
- Circuits are not unique—many sparse subgraphs can reproduce behavior, pruning reveals *a* mechanism not *the* mechanism
- Speed-faithfulness tradeoff is fundamental—slow patching is accurate, fast approximations trade correctness for scalability
- Circuit discovery is hypothesis testing—you're testing "does this subgraph explain behavior X" not discovering ground truth