Reward Hacking & Overoptimization: Goodhart's Law in Preference Optimization
Canonical Papers
Reward Model Ensembles Help Mitigate Overoptimization
Read paper →InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Read paper →Reward Model Overoptimisation in Iterated RLHF
Read paper →Core Mathematics
Reward hacking occurs when optimizing a proxy reward (learned from preferences) under distribution shift + finite data noise. The core shape across RLHF and preference optimization:
Proxy reward optimization with trust-region:
Conservative optimization (ensemble lower bound):
With ensemble :
Then optimize lower-confidence bound:
This is the "anti-Goodhart move": prefer high reward you're confident about.
InfoRM information bottleneck (filter spurious features):
where is preference label and is bottleneck representation dropping "shortcut" features that cause misgeneralization.
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- Frontier post-training assumes reward hacking is expected, not rare—conservative/ensemble objectives are standard practice
- KL doesn't "solve" reward hacking, it just slows it down—proxy can be wrong within trust region or policy can exploit loopholes without drifting far
- Offline preference optimization can make models worse—sparse/noisy labels can amplify bad options (Type I) or suppress good ones (Type II)
- Uncertainty is safety signal—ensembles turn reward hacking into measurable phenomenon: "high reward + high disagreement" = distrust
- After DPO/KTO (#24-25), this explains why alignment is fragile—and why frontier is about robust objectives + monitoring, not just picking algorithm
Missing Intuition
What is still poorly explained in textbooks and papers:
- Goodhart's law is the alignment tax—when learned score becomes target, optimization will find edge cases where score is wrong
- Reward hacking ≠ "model is evil"—often just distribution shift: policy explores outputs reward model never saw, extrapolates incorrectly
- Offline preference optimization is dataset-coverage constrained—if preference data never contains safety-critical edge cases, DPO won't invent them
- Sparsity/noise creates two failure modes: Type I (bad looks good) gets amplified, Type II (good looks bad) gets suppressed
- Ensemble disagreement is early warning system—high mean reward + high variance = region where proxy is likely broken