15Scaling & Alignment

Preference-Based Alignment: RLHF, Reward Modeling, Constitutional AI

Canonical Papers

Deep Reinforcement Learning from Human Preferences

Christiano et al.2017NeurIPS
Read paper →

Training Language Models to Follow Instructions with Human Feedback

Ouyang et al.2022NeurIPS (InstructGPT)
Read paper →

Constitutional AI: Harmlessness from AI Feedback

Bai et al.2022Anthropic
Read paper →

Core Mathematics

Reward modeling from preferences: Given human comparisons between outputs ya,yby_a, y_b, learn reward model via Bradley–Terry:

P(yaybx)=exp(rϕ(x,ya))exp(rϕ(x,ya))+exp(rϕ(x,yb))P(y_a \succ y_b \mid x) = \frac{\exp(r_\phi(x,y_a))}{\exp(r_\phi(x,y_a)) + \exp(r_\phi(x,y_b))}

RLHF objective: Fine-tune policy πθ(yx)\pi_\theta(y\mid x) to maximize reward while staying close to reference model π0\pi_0:

maxθEx,yπθ[rϕ(x,y)]βKL(πθ(x)π0(x))\max_\theta \mathbb E_{x,y\sim \pi_\theta}[r_\phi(x,y)] - \beta\, \mathrm{KL}(\pi_\theta(\cdot\mid x)\,\|\,\pi_0(\cdot\mid x))

Constitutional AI: "labeler" is another model guided by a constitution (natural-language principles).

Key Equation
maxθEπθ[rϕ(x,y)]βKL(πθπ0)\max_\theta \mathbb E_{\pi_\theta}[r_\phi(x,y)] - \beta\, \mathrm{KL}(\pi_\theta\,\|\,\pi_0)

Interactive Visualization

Why It Matters for Modern Models

  • GPT-4, Claude-3, Gemini rely on RLHF-style procedures to be helpful, honest, harmless
  • Constitutional AI ideas are key to Anthropic's Claude models

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Conceptual explanation of RLHF as a KL-regularized Bayesian update on behavior
  • How over-optimization of learned reward leads to reward hacking and distribution shift
  • Interactive visualizations of policy distributions before/after RLHF

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.