Activation Steering: Feature-Guided Interventions for Inference-Time Control
Canonical Papers
Activation Scaling for Steering and Interpreting Language Models
Read paper →Improving Instruction-Following in Language Models through Activation Steering
Read paper →Feature Guided Activation Additions
Read paper →Core Mathematics
Activation steering edits hidden states (or SAE latents) to change behavior without retraining, turning interpretability into control knobs for inference-time behavioral modification.
Classic steering vector addition (one layer/position):
where controls strength and is the steering direction.
Contrastive steering vector (difference of means):
Core idea behind "instruction vectors" and CAA-style methods—find direction that separates desired/undesired behaviors.
SAE-latent steering (feature toggle → decode back):
with a basis vector selecting feature —interpretable steering via learned feature directions.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Inference-time control without fine-tuning—enforce format/length/constraints, style shifts, reduce/induce refusals at deployment
- Steering is causal test—if adding feature k induces behavior, you've localized mechanism and can validate circuits (#28)
- Minimal interventions trend: activation scaling learns sparse scalars to strengthen/weaken existing directions, more interpretable than dense edits
- After SAEs (#27) and circuit discovery (#28), steering makes interpretability actionable—debugging & control, not just analysis
- Bridge to safety: controllable refusal/style/format constraints as post-training knobs, without expensive retraining loops
Missing Intuition
What is still poorly explained in textbooks and papers:
- Steering is geometry on model's manifold—small α nudges within-distribution, large α throws you off-manifold → incoherence and capability loss
- Interpretable steering isn't just *what direction*—it's *which basis*: SAE features give human handles, but decoder can entangle effects
- Compositionality isn't guaranteed—adding two "instruction vectors" can cancel or amplify depending on where they write in residual space
- Strength-capability tradeoff is fundamental—too weak has no effect, too strong breaks generation quality
- Steering reveals what model "knows"—if you can steer to behavior, the capability exists in the weights, just not naturally expressed