Alignment

RLHF: Reward Modeling + KL-Regularized Policy Optimization

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 20mlive demo
Editorial alignment illustration of human preference feedback shaping a reward landscape and updating a policy distribution.

Concept Structure

RLHF: Reward Modeling + KL-Regularized Policy Optimization

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites
3next concepts
1related links

Learning map

RLHF: Reward Modeling + KL-Regularized Policy Optimization
BeforeMaximum LikelihoodNow4/4 sections readyTryManipulate one control and predict the visible change.NextDirect Preference Optimization

Object flow

4/4 sections readyAsk about thisResearch room
ConceptRLHF: Reward Modeling + KL-Regularized Policy OptimizationAlignment
2 sources attachedLocal snapshot ready
concept:alignment/rlhf

Conceptual Bridge

What should feel connected as you move through this page.

Carry inMaximum Likelihood

Bring the mental model from Maximum Likelihood; this page will reuse it instead of restarting from zero.

Work hereRLHF: Reward Modeling + KL-Regularized Policy Optimization

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Carry outDirect Preference Optimization

The next edge should feel earned: use the demo prediction here before following Direct Preference Optimization.

Test the linkManipulate one control and predict the visible change.Then continue to Direct Preference Optimization
01

01

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Canonical sources: Christiano et al., "Deep Reinforcement Learning from Human Preferences", and Ouyang et al., "Training language models to follow instructions with human feedback".

A pretrained or supervised-finetuned language model gives a distribution over completions. In InstructGPT-style RLHF, the reference policy is often the supervised-finetuned model trained from demonstrations. RLHF asks:

When humans prefer one completion over another, how should that preference move probability mass?

The mechanism has two stages.

First, train a reward model from comparisons. It does not learn an absolute moral score; it learns differences that make preferred completions more likely under a pairwise preference model.

Second, optimize a policy against that learned reward while penalizing movement away from a reference model. In the finite-action picture:

RLHF multiplies the reference probability of each completion by an exponential reward bonus, then renormalizes.

High reward pulls probability upward. The KL penalty controls how far the new policy may drift. If the reward model is a proxy with exploitable errors, optimizing too aggressively can move probability mass toward outputs that score well under the proxy but are worse under the real target.

02

02

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

The two moving pieces are the preference-trained reward gap and the KL-shaped policy update:

Δϕ(x,yw,y)=rϕ(x,yw)rϕ(x,y),Pϕ(ywyx)=σ(Δϕ).\Delta_\phi(x,y_w,y_\ell) = r_\phi(x,y_w)-r_\phi(x,y_\ell), \qquad P_\phi(y_w\succ y_\ell\mid x)=\sigma(\Delta_\phi). Jx(π)=yπ(yx)rϕ(x,y)βKL(π(x)πref(x)),π(yx)=πref(yx)exp(rϕ(x,y)/β)yπref(yx)exp(rϕ(x,y)/β).J_x(\pi)= \sum_y \pi(y\mid x)r_\phi(x,y) - \beta\mathrm{KL}(\pi(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)), \qquad \pi^*(y\mid x)= \frac{\pi_{\mathrm{ref}}(y\mid x)\exp(r_\phi(x,y)/\beta)} {\sum_{y'}\pi_{\mathrm{ref}}(y'\mid x)\exp(r_\phi(x,y')/\beta)}.

The rest of this section unpacks those two witnesses.

Preference data and reward-model likelihood

Let a preference datum be (x,yw,y)(x,y_w,y_\ell), where ywy_w is preferred to yy_\ell for prompt xx. Define the reward gap

Δ=rϕ(x,yw)rϕ(x,y).\Delta = r_\phi(x,y_w)-r_\phi(x,y_\ell).

A reward model rϕ(x,y)Rr_\phi(x,y)\in\mathbb R predicts

Pϕ(ywyx)=σ(Δ).P_\phi(y_w\succ y_\ell\mid x)=\sigma(\Delta).

The negative log-likelihood is

LRM(ϕ)=1Ni=1Nlogσ(Δi),\mathcal L_{\mathrm{RM}}(\phi) = -\frac{1}{N} \sum_{i=1}^N \log\sigma(\Delta_i),

where

Δi=rϕ(xi,yw,i)rϕ(xi,y,i).\Delta_i = r_\phi(x_i,y_{w,i}) - r_\phi(x_i,y_{\ell,i}).

Only reward differences inside the same prompt are observed. Therefore rϕ(x,y)+c(x)r_\phi(x,y)+c(x) gives the same pairwise probabilities as rϕ(x,y)r_\phi(x,y).

Reward scale also needs a convention. With a fixed logistic noise model, reward gaps are measured in that model's log-odds units. With perfectly separable hard preferences, an unregularized reward model can drive gaps toward infinity. Practical systems therefore normalize, regularize, or otherwise choose a usable reward scale before policy optimization.

KL-regularized policy optimization

For a fixed prompt xx, reference policy πref\pi_{\mathrm{ref}}, and candidate policy π\pi, define expected reward

Rx(π)=yπ(yx)rϕ(x,y)R_x(\pi) = \sum_y \pi(y\mid x)r_\phi(x,y)

and the KL term

Kx(π)=KL(π(x)πref(x)).K_x(\pi)=\mathrm{KL}(\pi(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)).

The RLHF objective is

Jx(π)=Rx(π)βKx(π).J_x(\pi)=R_x(\pi)-\beta K_x(\pi).

Equivalently, if

qy=logπ(yx)πref(yx),q_y = \log \frac{\pi(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)},

then

Jx(π)=yπ(yx){rϕ(x,y)βqy}.J_x(\pi) = \sum_y \pi(y\mid x)\{r_\phi(x,y)-\beta q_y\}.

Assume πref(yx)>0\pi_{\mathrm{ref}}(y\mid x)>0 on the candidate support. Optimize JxJ_x over the finite distribution π(x)\pi(\cdot\mid x) subject to yπ(yx)=1\sum_y \pi(y\mid x)=1. The Lagrange stationarity condition is

rϕ(x,y)β(qy+1)+λ=0.r_\phi(x,y)-\beta(q_y+1)+\lambda=0.

Solving for π(yx)\pi(y\mid x) and normalizing gives

π(yx)=πref(yx)exp(rϕ(x,y)/β)Z(x).\pi^*(y\mid x) = \frac{ \pi_{\mathrm{ref}}(y\mid x)\exp(r_\phi(x,y)/\beta) }{ Z(x) }.

Here

Z(x)=yπref(yx)exp(rϕ(x,y)/β).Z(x) = \sum_{y'} \pi_{\mathrm{ref}}(y'\mid x) \exp(r_\phi(x,y')/\beta).

Larger β\beta keeps the policy closer to the reference. Smaller β\beta lets learned reward dominate.

Rearranging gives the bridge to DPO:

rϕ(x,y)=βlogπ(yx)πref(yx)+βlogZ(x).r_\phi(x,y) = \beta\log \frac{\pi^*(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)} + \beta\log Z(x).

The final term depends only on the prompt, so it cancels in preference differences.

PPO is the optimizer, not the definition

In language-model RLHF, the candidate space is enormous. InstructGPT-style RLHF uses PPO to optimize the learned reward with a KL penalty to the supervised-finetuned or reference model. The clean finite-action formula above is the exact optimum of a finite-action regularized objective. PPO is a stochastic optimizer for the language-model version; it is not this closed-form update, and a parametric policy with per-token KL details need not trace the toy optimum exactly.

Reward hacking

If rϕ(x,y)r_\phi(x,y) differs from the real target u(x,y)u(x,y), then low β\beta can concentrate the policy on outputs with high proxy reward and low true utility. KL regularization reduces this pressure but does not make the proxy correct.

03

03

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

This witness keeps the pieces visible: pairwise reward-model fitting, reward-shift invariance, and the KL-regularized policy that reweights a reference distribution.

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def softmax(logits):
    logits = logits - logits.max()
    e = np.exp(logits)
    return e / e.sum()

# One prompt, four candidate completions.
# Shape: rewards, pi_ref, pi_star are all (K,).
pairs = np.array([
    [0, 1],
    [0, 2],
    [2, 1],
    [3, 1],
    [0, 3],
])

K = 4
r = np.zeros(K)
lr = 0.3
l2 = 0.05

for _ in range(800):
    grad = l2 * r

    for winner, loser in pairs:
        margin = r[winner] - r[loser]
        p = sigmoid(margin)
        g = p - 1.0
        grad[winner] += g
        grad[loser] -= g

    r -= lr * grad / len(pairs)
    r -= r.mean()  # choose one representative of the shift-equivalence class

pi_ref = np.array([0.35, 0.30, 0.20, 0.15])
beta = 0.7

def kl_regularized_policy(reward):
    logits = np.log(pi_ref) + reward / beta
    return softmax(logits)

pi_star = kl_regularized_policy(r)
pi_shifted = kl_regularized_policy(r + 10.0)

kl = np.sum(pi_star * (np.log(pi_star) - np.log(pi_ref)))

print("learned reward representative:", np.round(r, 3))
print("reference policy:             ", np.round(pi_ref, 3))
print("KL-regularized policy:        ", np.round(pi_star, 3))
print("KL(pi* || pi_ref):            ", round(float(kl), 4))

assert np.allclose(pi_star, pi_shifted)

Adding a constant to every reward changes neither pairwise preference probabilities nor the KL-regularized policy. The policy depends on reward differences and on how strongly β\beta anchors it to the reference.

04

04

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the demo as a probability-shaping machine. Change β\beta, toggle a proxy reward gap, and add a reward shift. Watch the reference policy get reweighted, and notice that shifting every reward leaves the policy unchanged.

Live Concept Demo

Explore RLHF: Reward Modeling + KL-Regularized Policy Optimization

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned
Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what RLHF: Reward Modeling + KL-Regularized Policy Optimization should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Prediction open01 / Intuition
Editorial alignment illustration of human preference feedback shaping a reward landscape and updating a policy distribution.
Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change RLHF: Reward Modeling + KL-Regularized Policy Optimization should make visible.

Visual Inquiry

Make the image answer a mathematical question

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

4/4 stages readyLive demo connected
Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make RLHF: Reward Modeling + KL-Regularized Policy Optimization easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

paper · 2017Deep reinforcement learning from human preferencesChristiano et al.

Grounds preference-labeled reward models as a way to train behavior from comparative human feedback.

Open source
paper · 2022Training language models to follow instructions with human feedbackOuyang et al.

Grounds the modern instruction-following RLHF pipeline: demonstrations, preference rankings, reward model, and PPO.

Open source

Claim Review

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources2 references

christiano-2017-human-preferences, ouyang-2022-instructgpt

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedPreference-based RLHF learns a reward model from human comparisons and optimizes a policy against it; InstructGPT-style RLHF adds SFT and PPO with a KL penalty to the SFT/reference policy.Claim metadata: source checked

Christiano trains a reward predictor from trajectory-segment comparisons and optimizes a policy on predicted reward. Ouyang describes demonstrations -> SFT, rankings -> reward model, and PPO against RM reward with per-token KL to the SFT policy. Local math/code/demo instantiate reward gaps, sigmoid preferences, pi_ref*exp(r/beta), KL, shift invariance, and proxy-gap warnings.

Sources: Deep reinforcement learning from human preferences, Training language models to follow instructions with human feedbackReviews only preference-modeling and KL-regularized optimization mechanics. It does not certify reward as true human objective, PPO exact attainment of the finite-action optimum, PPO-ptx/pretraining-gradient details, reward-hacking prevention, or broad alignment guarantees.A bounded review summary is present; still check caveats and exact source scope.

Christiano supports learning a reward predictor from pairwise trajectory preferences and optimizing a policy on predicted reward. Ouyang supports InstructGPT demonstrations -> SFT, rankings -> RM, and PPO against RM with per-token KL to SFT. Local math/code/demo are toy witnesses for sigmoid preferences, KL probability shaping, shift invariance, and proxy-gap caveats.

Reviewer: codex+oracle; reviewed 2026-05-07

Practice Loop

Try the idea before it explains itself

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Readiness0/3 checks ready
Predict

Before touching the demo, predict one visible change that should happen in RLHF: Reward Modeling + KL-Regularized Policy Optimization.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Object research drawerClose
ConceptRLHF: Reward Modeling + KL-Regularized Policy OptimizationAlignment

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.
Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptAlignment

RLHF: Reward Modeling + KL-Regularized Policy Optimization

Anchored question

What is the smallest example that makes RLHF: Reward Modeling + KL-Regularized Policy Optimization click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action
Local action draft

This draft stays locally in this browser for concept:alignment/rlhf.

No local draft saved.
Evidence to inspect
  • Source ids to inspect: christiano-2017-human-preferences, ouyang-2022-instructgpt
  • Definition, prerequisite, and contrast concept links
  • The equation or code witness that makes the concept operational
  • One demo state that shows the invariant instead of a slogan
What would resolve this
  • The learner can state the mechanism in their own words
  • The learner can name the prerequisite that would repair confusion
  • The learner can predict how the mechanism changes under one perturbation
Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - RLHF: Reward Modeling + KL-Regularized Policy Optimization Object key: concept:alignment/rlhf Context: Alignment Anchor id: concept/concept-notebook/alignment/rlhf Open question: What is the smallest example that makes RLHF: Reward Modeling + KL-Regularized Policy Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: christiano-2017-human-preferences, ouyang-2022-instructgpt - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object
concept/concept-notebook/alignment/rlhf concept:alignment/rlhf