29Representations

🎚️Activation Steering: Feature-Guided Interventions for Inference-Time Control

Canonical Papers

Activation Scaling for Steering and Interpreting Language Models

Stoehr et al.2024EMNLP Findings

Read paper →

Improving Instruction-Following in Language Models through Activation Steering

Stolfo et al.2025ICLR (arXiv)

Read paper →

Feature Guided Activation Additions

Soo et al.2025OpenReview

Read paper →

Core Mathematics

Activation steering edits hidden states (or SAE latents) to change behavior without retraining, turning interpretability into control knobs for inference-time behavioral modification.

Classic steering vector addition (one layer/position):

h_{\ell,t}^{\text{steer}} = h_{\ell,t} + \alpha v_{\ell,t}

where $\alpha$ controls strength and $v_{\ell,t}$ is the steering direction.

Contrastive steering vector (difference of means):

v_{\ell,t} = \mathbb{E}[h_{\ell,t}\mid \text{desired}] - \mathbb{E}[h_{\ell,t}\mid \text{undesired}]

Core idea behind "instruction vectors" and CAA-style methods—find direction that separates desired/undesired behaviors.

SAE-latent steering (feature toggle → decode back):

z = f_{\text{enc}}(h_{\ell,t}), \quad h_{\ell,t}^{\text{steer}} = f_{\text{dec}}\!\left(z + \delta e_k\right)

with $e_k$ a basis vector selecting feature $k$ —interpretable steering via learned feature directions.

Key Equation

h_{\ell,t}^{\text{steer}} = h_{\ell,t} + \alpha v_{\ell,t}

Interactive Visualization

Why It Matters for Modern Models

Inference-time control without fine-tuning—enforce format/length/constraints, style shifts, reduce/induce refusals at deployment
Steering is causal test—if adding feature k induces behavior, you've localized mechanism and can validate circuits (#28)
Minimal interventions trend: activation scaling learns sparse scalars to strengthen/weaken existing directions, more interpretable than dense edits
After SAEs (#27) and circuit discovery (#28), steering makes interpretability actionable—debugging & control, not just analysis
Bridge to safety: controllable refusal/style/format constraints as post-training knobs, without expensive retraining loops

Missing Intuition

What is still poorly explained in textbooks and papers:

Steering is geometry on model's manifold—small α nudges within-distribution, large α throws you off-manifold → incoherence and capability loss
Interpretable steering isn't just *what direction*—it's *which basis*: SAE features give human handles, but decoder can entangle effects
Compositionality isn't guaranteed—adding two "instruction vectors" can cancel or amplify depending on where they write in residual space
Strength-capability tradeoff is fundamental—too weak has no effect, too strong breaks generation quality
Steering reveals what model "knows"—if you can steer to behavior, the capability exists in the weights, just not naturally expressed

Connections

Prerequisites

⊗Attention ◎Embeddings 🔍Sparse Autoencoders 🔬Circuit Discovery

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

⚠️ Breaks When

🔍Feature entanglement→Sparse Autoencoders

🔄 Same Technique

◎Linear directions→Embeddings

🔧 Invented to Fix

🔍Targeted intervention→Sparse Autoencoders 🔬Inference interventions→Circuit Discovery

≈ Analogy

🎲Inference-time control→Decoding