28Representations

🔬Automated Circuit Discovery: Patching, Attribution & Decomposition at Scale

Canonical Papers

Towards Automated Circuit Discovery for Mechanistic Interpretability

Conmy et al.2023NeurIPS
Read paper →

Attribution Patching Outperforms Automated Circuit Discovery

Syed et al.2024BlackBoxNLP / ACL
Read paper →

Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition

Hsu et al.2025ICLR
Read paper →

Core Mathematics

Automated circuit discovery turns mechanistic interpretability into a search/scoring/pruning problem over the model's computational graph, moving from hand-crafted case studies to scalable pipelines.

Activation patching as causal intervention (edge/node ablation):

ΔE=L(xcleando(E=ecorr))L(xclean)\Delta_E = \left| L(x_{\text{clean}}\mid \text{do}(E=e_{\text{corr}})) - L(x_{\text{clean}}) \right|

This measures causal effect: replace edge EE with value from corrupted run, measure impact on loss.

Attribution patching (first-order/Taylor approximation):

L(xcleando(E=ecorr))L(xclean)+(ecorreclean)ecleanLL(x_{\text{clean}}\mid \text{do}(E=e_{\text{corr}})) \approx L(x_{\text{clean}}) + (e_{\text{corr}}-e_{\text{clean}})^\top \frac{\partial}{\partial e_{\text{clean}}}L

Linearized causal estimate—faster than full patching, but approximation can be unfaithful.

Edge scoring + pruning (circuit extraction):

sE=ΔEL,then keep top-k edgess_E = |\Delta_E L|, \quad \text{then keep top-}k\text{ edges}

Rank edges by importance, prune to minimal circuit that preserves behavior.

Key Equation
ΔE=L(xcleando(E=ecorr))L(xclean)\Delta_E = \left| L(x_{\text{clean}}\mid \text{do}(E=e_{\text{corr}})) - L(x_{\text{clean}}) \right|

Interactive Visualization

Why It Matters for Modern Models

  • Moves mechanistic interpretability from "one-off archaeology" to repeatable pipeline—essential for scaling to frontier models
  • 2024-2025 trend: ACDC (slow patching) → EAP (faster approximation) → CD-T (decomposition in seconds)—speed is the bottleneck
  • Enables target selection for steering, safety interventions, debugging—you need to find the circuit before you can edit it
  • After SAEs (#27), this shows how to build feature-level circuits automatically, not just head/neuron circuits
  • Bridges to automated red-teaming and interpretability evaluations at scale—necessary for AI safety pipelines

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Activation patching is causal; attribution patching is linearized causal—approximation can be useful even when unfaithful pointwise
  • Metric choice is everything—some metrics create degenerate gradients (zero-gradient at optimum), automated methods inherit these failures
  • Circuits are not unique—many sparse subgraphs can reproduce behavior, pruning reveals *a* mechanism not *the* mechanism
  • Speed-faithfulness tradeoff is fundamental—slow patching is accurate, fast approximations trade correctness for scalability
  • Circuit discovery is hypothesis testing—you're testing "does this subgraph explain behavior X" not discovering ground truth

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.