29Representations

🎚️Activation Steering: Feature-Guided Interventions for Inference-Time Control

Canonical Papers

Activation Scaling for Steering and Interpreting Language Models

Stoehr et al.2024EMNLP Findings
Read paper →

Improving Instruction-Following in Language Models through Activation Steering

Stolfo et al.2025ICLR (arXiv)
Read paper →

Feature Guided Activation Additions

Soo et al.2025OpenReview
Read paper →

Core Mathematics

Activation steering edits hidden states (or SAE latents) to change behavior without retraining, turning interpretability into control knobs for inference-time behavioral modification.

Classic steering vector addition (one layer/position):

h,tsteer=h,t+αv,th_{\ell,t}^{\text{steer}} = h_{\ell,t} + \alpha v_{\ell,t}

where α\alpha controls strength and v,tv_{\ell,t} is the steering direction.

Contrastive steering vector (difference of means):

v,t=E[h,tdesired]E[h,tundesired]v_{\ell,t} = \mathbb{E}[h_{\ell,t}\mid \text{desired}] - \mathbb{E}[h_{\ell,t}\mid \text{undesired}]

Core idea behind "instruction vectors" and CAA-style methods—find direction that separates desired/undesired behaviors.

SAE-latent steering (feature toggle → decode back):

z=fenc(h,t),h,tsteer=fdec ⁣(z+δek)z = f_{\text{enc}}(h_{\ell,t}), \quad h_{\ell,t}^{\text{steer}} = f_{\text{dec}}\!\left(z + \delta e_k\right)

with eke_k a basis vector selecting feature kk—interpretable steering via learned feature directions.

Key Equation
h,tsteer=h,t+αv,th_{\ell,t}^{\text{steer}} = h_{\ell,t} + \alpha v_{\ell,t}

Interactive Visualization

Why It Matters for Modern Models

  • Inference-time control without fine-tuning—enforce format/length/constraints, style shifts, reduce/induce refusals at deployment
  • Steering is causal test—if adding feature k induces behavior, you've localized mechanism and can validate circuits (#28)
  • Minimal interventions trend: activation scaling learns sparse scalars to strengthen/weaken existing directions, more interpretable than dense edits
  • After SAEs (#27) and circuit discovery (#28), steering makes interpretability actionable—debugging & control, not just analysis
  • Bridge to safety: controllable refusal/style/format constraints as post-training knobs, without expensive retraining loops

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Steering is geometry on model's manifold—small α nudges within-distribution, large α throws you off-manifold → incoherence and capability loss
  • Interpretable steering isn't just *what direction*—it's *which basis*: SAE features give human handles, but decoder can entangle effects
  • Compositionality isn't guaranteed—adding two "instruction vectors" can cancel or amplify depending on where they write in residual space
  • Strength-capability tradeoff is fundamental—too weak has no effect, too strong breaks generation quality
  • Steering reveals what model "knows"—if you can steer to behavior, the capability exists in the weights, just not naturally expressed

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.