26Scaling & Alignment

⚠️Reward Hacking & Overoptimization: Goodhart's Law in Preference Optimization

Canonical Papers

Reward Model Ensembles Help Mitigate Overoptimization

Coste et al.2024ICLR

Read paper →

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

Miao et al.2024NeurIPS

Read paper →

Reward Model Overoptimisation in Iterated RLHF

Wolf et al.2025arXiv

Read paper →

Core Mathematics

Reward hacking occurs when optimizing a proxy reward (learned from preferences) under distribution shift + finite data noise. The core shape across RLHF and preference optimization:

Proxy reward optimization with trust-region:

\max_{\pi_\theta} \mathbb{E}_{y\sim \pi_\theta(\cdot|x)}[\hat r_\phi(x,y)] - \beta\,\text{KL}\!\left(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right)

Conservative optimization (ensemble lower bound):

With ensemble $\{r_{\phi_j}\}_{j=1}^K$ :

\mu(x,y)=\frac{1}{K}\sum_{j=1}^K r_{\phi_j}(x,y), \quad \sigma(x,y)=\sqrt{\frac{1}{K}\sum_{j=1}^K\left(r_{\phi_j}(x,y)-\mu(x,y)\right)^2}

Then optimize lower-confidence bound:

r_{\text{LCB}}(x,y)=\mu(x,y)-\lambda\,\sigma(x,y)

This is the "anti-Goodhart move": prefer high reward you're confident about.

InfoRM information bottleneck (filter spurious features):

\max_{\phi} \mathbb{E}[\log p_\phi(\ell|z)] - \alpha\,\text{KL}\!\left(q_\phi(z|x,y) \| p(z)\right)

where $\ell$ is preference label and $z$ is bottleneck representation dropping "shortcut" features that cause misgeneralization.

Key Equation

r_{\text{LCB}}(x,y)=\mu(x,y)-\lambda\,\sigma(x,y)

Interactive Visualization

Why It Matters for Modern Models

Frontier post-training assumes reward hacking is expected, not rare—conservative/ensemble objectives are standard practice
KL doesn't "solve" reward hacking, it just slows it down—proxy can be wrong within trust region or policy can exploit loopholes without drifting far
Offline preference optimization can make models worse—sparse/noisy labels can amplify bad options (Type I) or suppress good ones (Type II)
Uncertainty is safety signal—ensembles turn reward hacking into measurable phenomenon: "high reward + high disagreement" = distrust
After DPO/KTO (#24-25), this explains why alignment is fragile—and why frontier is about robust objectives + monitoring, not just picking algorithm

Missing Intuition

What is still poorly explained in textbooks and papers:

Goodhart's law is the alignment tax—when learned score becomes target, optimization will find edge cases where score is wrong
Reward hacking ≠ "model is evil"—often just distribution shift: policy explores outputs reward model never saw, extrapolates incorrectly
Offline preference optimization is dataset-coverage constrained—if preference data never contains safety-critical edge cases, DPO won't invent them
Sparsity/noise creates two failure modes: Type I (bad looks good) gets amplified, Type II (good looks bad) gets suppressed
Ensemble disagreement is early warning system—high mean reward + high variance = region where proxy is likely broken

Connections

Prerequisites

⚖RLHF 🎯DPO 👍KTO

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

🔧 Invented to Fix

🎯Remove reward model→DPO

⚠️ Breaks When

🎯Distribution shift→DPO ⚖Proxy gaming→RLHF ⚡Speed vs safety tradeoff→Efficiency

🔄 Same Technique

⚔Learned loss gaming→GANs