26Scaling & Alignment

⚠️Reward Hacking & Overoptimization: Goodhart's Law in Preference Optimization

Canonical Papers

Reward Model Ensembles Help Mitigate Overoptimization

Coste et al.2024ICLR
Read paper →

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Miao et al.2024NeurIPS
Read paper →

Reward Model Overoptimisation in Iterated RLHF

Wolf et al.2025arXiv
Read paper →

Core Mathematics

Reward hacking occurs when optimizing a proxy reward (learned from preferences) under distribution shift + finite data noise. The core shape across RLHF and preference optimization:

Proxy reward optimization with trust-region:

maxπθEyπθ(x)[r^ϕ(x,y)]βKL ⁣(πθ(x)πref(x))\max_{\pi_\theta} \mathbb{E}_{y\sim \pi_\theta(\cdot|x)}[\hat r_\phi(x,y)] - \beta\,\text{KL}\!\left(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right)

Conservative optimization (ensemble lower bound):

With ensemble {rϕj}j=1K\{r_{\phi_j}\}_{j=1}^K:

μ(x,y)=1Kj=1Krϕj(x,y),σ(x,y)=1Kj=1K(rϕj(x,y)μ(x,y))2\mu(x,y)=\frac{1}{K}\sum_{j=1}^K r_{\phi_j}(x,y), \quad \sigma(x,y)=\sqrt{\frac{1}{K}\sum_{j=1}^K\left(r_{\phi_j}(x,y)-\mu(x,y)\right)^2}

Then optimize lower-confidence bound:

rLCB(x,y)=μ(x,y)λσ(x,y)r_{\text{LCB}}(x,y)=\mu(x,y)-\lambda\,\sigma(x,y)

This is the "anti-Goodhart move": prefer high reward you're confident about.

InfoRM information bottleneck (filter spurious features):

maxϕE[logpϕ(z)]αKL ⁣(qϕ(zx,y)p(z))\max_{\phi} \mathbb{E}[\log p_\phi(\ell|z)] - \alpha\,\text{KL}\!\left(q_\phi(z|x,y) \| p(z)\right)

where \ell is preference label and zz is bottleneck representation dropping "shortcut" features that cause misgeneralization.

Key Equation
rLCB(x,y)=μ(x,y)λσ(x,y)r_{\text{LCB}}(x,y)=\mu(x,y)-\lambda\,\sigma(x,y)

Interactive Visualization

Why It Matters for Modern Models

  • Frontier post-training assumes reward hacking is expected, not rare—conservative/ensemble objectives are standard practice
  • KL doesn't "solve" reward hacking, it just slows it down—proxy can be wrong within trust region or policy can exploit loopholes without drifting far
  • Offline preference optimization can make models worse—sparse/noisy labels can amplify bad options (Type I) or suppress good ones (Type II)
  • Uncertainty is safety signal—ensembles turn reward hacking into measurable phenomenon: "high reward + high disagreement" = distrust
  • After DPO/KTO (#24-25), this explains why alignment is fragile—and why frontier is about robust objectives + monitoring, not just picking algorithm

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Goodhart's law is the alignment tax—when learned score becomes target, optimization will find edge cases where score is wrong
  • Reward hacking ≠ "model is evil"—often just distribution shift: policy explores outputs reward model never saw, extrapolates incorrectly
  • Offline preference optimization is dataset-coverage constrained—if preference data never contains safety-critical edge cases, DPO won't invent them
  • Sparsity/noise creates two failure modes: Type I (bad looks good) gets amplified, Type II (good looks bad) gets suppressed
  • Ensemble disagreement is early warning system—high mean reward + high variance = region where proxy is likely broken

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.