28Representations

🔬Automated Circuit Discovery: Patching, Attribution & Decomposition at Scale

Canonical Papers

Towards Automated Circuit Discovery for Mechanistic Interpretability

Conmy et al.2023NeurIPS

Read paper →

Attribution Patching Outperforms Automated Circuit Discovery

Syed et al.2024BlackBoxNLP / ACL

Read paper →

Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition

Hsu et al.2025ICLR

Read paper →

Core Mathematics

Automated circuit discovery turns mechanistic interpretability into a search/scoring/pruning problem over the model's computational graph, moving from hand-crafted case studies to scalable pipelines.

Activation patching as causal intervention (edge/node ablation):

\Delta_E = \left| L(x_{\text{clean}}\mid \text{do}(E=e_{\text{corr}})) - L(x_{\text{clean}}) \right|

This measures causal effect: replace edge $E$ with value from corrupted run, measure impact on loss.

Attribution patching (first-order/Taylor approximation):

L(x_{\text{clean}}\mid \text{do}(E=e_{\text{corr}})) \approx L(x_{\text{clean}}) + (e_{\text{corr}}-e_{\text{clean}})^\top \frac{\partial}{\partial e_{\text{clean}}}L

Linearized causal estimate—faster than full patching, but approximation can be unfaithful.

Edge scoring + pruning (circuit extraction):

s_E = |\Delta_E L|, \quad \text{then keep top-}k\text{ edges}

Rank edges by importance, prune to minimal circuit that preserves behavior.

Key Equation

\Delta_E = \left| L(x_{\text{clean}}\mid \text{do}(E=e_{\text{corr}})) - L(x_{\text{clean}}) \right|

Interactive Visualization

Why It Matters for Modern Models

Moves mechanistic interpretability from "one-off archaeology" to repeatable pipeline—essential for scaling to frontier models
2024-2025 trend: ACDC (slow patching) → EAP (faster approximation) → CD-T (decomposition in seconds)—speed is the bottleneck
Enables target selection for steering, safety interventions, debugging—you need to find the circuit before you can edit it
After SAEs (#27), this shows how to build feature-level circuits automatically, not just head/neuron circuits
Bridges to automated red-teaming and interpretability evaluations at scale—necessary for AI safety pipelines

Missing Intuition

What is still poorly explained in textbooks and papers:

Activation patching is causal; attribution patching is linearized causal—approximation can be useful even when unfaithful pointwise
Metric choice is everything—some metrics create degenerate gradients (zero-gradient at optimum), automated methods inherit these failures
Circuits are not unique—many sparse subgraphs can reproduce behavior, pruning reveals *a* mechanism not *the* mechanism
Speed-faithfulness tradeoff is fundamental—slow patching is accurate, fast approximations trade correctness for scalability
Circuit discovery is hypothesis testing—you're testing "does this subgraph explain behavior X" not discovering ground truth

Connections

Prerequisites

⊗Attention ⊛Circuits 🔍Sparse Autoencoders

Enables

🎚️Activation Steering

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

🔧 Invented to Fix

🎚️Inference interventions→Activation Steering ⚲Causal circuits→Probing

🔄 Same Technique

⊛Ablation studies→Circuits ΘFirst-order Taylor→NTK

↔️ Mathematical Dual

⚲Top-down vs bottom-up→Probing

⚠️ Breaks When

⌇Linear patching misses curvature→Sharpness

≈ Analogy

⊛Circuits→Circuits