Alignment

RLHF: Reward Modeling + KL-Regularized Policy Optimization

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 20mlive demo

Back to Alignment Next: Direct Preference Optimization

Editorial alignment illustration of human preference feedback shaping a reward landscape and updating a policy distribution.

Concept Structure

RLHF: Reward Modeling + KL-Regularized Policy Optimization

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites

3next concepts

1related links

Learning map

RLHF: Reward Modeling + KL-Regularized Policy Optimization

BeforeMaximum LikelihoodNow4/4 sections readyTryManipulate one control and predict the visible change.NextDirect Preference Optimization

Object flow

4/4 sections readyAsk about this Research room

ConceptRLHF: Reward Modeling + KL-Regularized Policy OptimizationAlignment EquationRLHF: Reward Modeling + KL-Regularized Policy Optimization equation 1Exact equation object CodeRLHF: Reward Modeling + KL-Regularized Policy Optimization code witne...Exact code witness DemoRLHF: Reward Modeling + KL-Regularized Policy Optimization interactiv...Visualization object ClaimPreference-based RLHF learns a reward model from human comparisons an...Exact claim check SourceDeep reinforcement learning from human preferencesExact source object

ConceptRLHF: Reward Modeling + KL-Regularized Policy OptimizationAlignment

2 sources attachedLocal snapshot ready

concept:alignment/rlhf

Codewitness nearby Predictbefore reveal Roomobject handoff

Conceptual Bridge

What should feel connected as you move through this page.

Carry inMaximum Likelihood

Bring the mental model from Maximum Likelihood; this page will reuse it instead of restarting from zero.

Work hereRLHF: Reward Modeling + KL-Regularized Policy Optimization

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Carry outDirect Preference Optimization

The next edge should feel earned: use the demo prediction here before following Direct Preference Optimization.

Test the linkManipulate one control and predict the visible change.Then continue to Direct Preference Optimization

01IntuitionStart with the picture, metaphor, or geometric mechanism.02MathMake the objects explicit and connect them with notation.03CodeMirror the equations with runnable implementation details.04Interactive DemoManipulate the mechanism and watch the idea respond.

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Canonical sources: Christiano et al., "Deep Reinforcement Learning from Human Preferences", and Ouyang et al., "Training language models to follow instructions with human feedback".

A pretrained or supervised-finetuned language model gives a distribution over completions. In InstructGPT-style RLHF, the reference policy is often the supervised-finetuned model trained from demonstrations. RLHF asks:

When humans prefer one completion over another, how should that preference move probability mass?

The mechanism has two stages.

First, train a reward model from comparisons. It does not learn an absolute moral score; it learns differences that make preferred completions more likely under a pairwise preference model.

Second, optimize a policy against that learned reward while penalizing movement away from a reference model. In the finite-action picture:

RLHF multiplies the reference probability of each completion by an exponential reward bonus, then renormalizes.

High reward pulls probability upward. The KL penalty controls how far the new policy may drift. If the reward model is a proxy with exploitable errors, optimizing too aggressively can move probability mass toward outputs that score well under the proxy but are worse under the real target.

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Equation 1\Delta_\phi(x,y_w,y_\ell) = r_\phi(x,y_w)-r_\phi(x,y_\ell), \qquad P_\phi(y_w\succ y_\ell\mid...Equation 2J_x(\pi)= \sum_y \pi(y\mid x)r_\phi(x,y) - \beta\mathrm{KL}(\pi(\cdot\mid x)\|\pi_{\mathrm{re...

The two moving pieces are the preference-trained reward gap and the KL-shaped policy update:

\Delta_\phi(x,y_w,y_\ell) = r_\phi(x,y_w)-r_\phi(x,y_\ell), \qquad P_\phi(y_w\succ y_\ell\mid x)=\sigma(\Delta_\phi).

J_x(\pi)= \sum_y \pi(y\mid x)r_\phi(x,y) - \beta\mathrm{KL}(\pi(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)), \qquad \pi^*(y\mid x)= \frac{\pi_{\mathrm{ref}}(y\mid x)\exp(r_\phi(x,y)/\beta)} {\sum_{y'}\pi_{\mathrm{ref}}(y'\mid x)\exp(r_\phi(x,y')/\beta)}.

The rest of this section unpacks those two witnesses.

Preference data and reward-model likelihood

Let a preference datum be $(x,y_w,y_\ell)$ , where $y_w$ is preferred to $y_\ell$ for prompt $x$ . Define the reward gap

\Delta = r_\phi(x,y_w)-r_\phi(x,y_\ell).

A reward model $r_\phi(x,y)\in\mathbb R$ predicts

P_\phi(y_w\succ y_\ell\mid x)=\sigma(\Delta).

The negative log-likelihood is

\mathcal L_{\mathrm{RM}}(\phi) = -\frac{1}{N} \sum_{i=1}^N \log\sigma(\Delta_i),

where

\Delta_i = r_\phi(x_i,y_{w,i}) - r_\phi(x_i,y_{\ell,i}).

Only reward differences inside the same prompt are observed. Therefore $r_\phi(x,y)+c(x)$ gives the same pairwise probabilities as $r_\phi(x,y)$ .

Reward scale also needs a convention. With a fixed logistic noise model, reward gaps are measured in that model's log-odds units. With perfectly separable hard preferences, an unregularized reward model can drive gaps toward infinity. Practical systems therefore normalize, regularize, or otherwise choose a usable reward scale before policy optimization.

KL-regularized policy optimization

For a fixed prompt $x$ , reference policy $\pi_{\mathrm{ref}}$ , and candidate policy $\pi$ , define expected reward

R_x(\pi) = \sum_y \pi(y\mid x)r_\phi(x,y)

and the KL term

K_x(\pi)=\mathrm{KL}(\pi(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)).

The RLHF objective is

J_x(\pi)=R_x(\pi)-\beta K_x(\pi).

Equivalently, if

q_y = \log \frac{\pi(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)},

then

J_x(\pi) = \sum_y \pi(y\mid x)\{r_\phi(x,y)-\beta q_y\}.

Assume $\pi_{\mathrm{ref}}(y\mid x)>0$ on the candidate support. Optimize $J_x$ over the finite distribution $\pi(\cdot\mid x)$ subject to $\sum_y \pi(y\mid x)=1$ . The Lagrange stationarity condition is

r_\phi(x,y)-\beta(q_y+1)+\lambda=0.

Solving for $\pi(y\mid x)$ and normalizing gives

\pi^*(y\mid x) = \frac{ \pi_{\mathrm{ref}}(y\mid x)\exp(r_\phi(x,y)/\beta) }{ Z(x) }.

Here

Z(x) = \sum_{y'} \pi_{\mathrm{ref}}(y'\mid x) \exp(r_\phi(x,y')/\beta).

Larger $\beta$ keeps the policy closer to the reference. Smaller $\beta$ lets learned reward dominate.

Rearranging gives the bridge to DPO:

r_\phi(x,y) = \beta\log \frac{\pi^*(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)} + \beta\log Z(x).

The final term depends only on the prompt, so it cancels in preference differences.

PPO is the optimizer, not the definition

In language-model RLHF, the candidate space is enormous. InstructGPT-style RLHF uses PPO to optimize the learned reward with a KL penalty to the supervised-finetuned or reference model. The clean finite-action formula above is the exact optimum of a finite-action regularized objective. PPO is a stochastic optimizer for the language-model version; it is not this closed-form update, and a parametric policy with per-token KL details need not trace the toy optimum exactly.

Reward hacking

If $r_\phi(x,y)$ differs from the real target $u(x,y)$ , then low $\beta$ can concentrate the policy on outputs with high proxy reward and low true utility. KL regularization reduces this pressure but does not make the proxy correct.

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

Code witness 1import numpy as np def sigmoid(z): return 1.0 / (1.0 + np.exp(-z)) def softmax(logits): logit...python

This witness keeps the pieces visible: pairwise reward-model fitting, reward-shift invariance, and the KL-regularized policy that reweights a reference distribution.

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def softmax(logits):
    logits = logits - logits.max()
    e = np.exp(logits)
    return e / e.sum()

# One prompt, four candidate completions.
# Shape: rewards, pi_ref, pi_star are all (K,).
pairs = np.array([
    [0, 1],
    [0, 2],
    [2, 1],
    [3, 1],
    [0, 3],
])

K = 4
r = np.zeros(K)
lr = 0.3
l2 = 0.05

for _ in range(800):
    grad = l2 * r

    for winner, loser in pairs:
        margin = r[winner] - r[loser]
        p = sigmoid(margin)
        g = p - 1.0
        grad[winner] += g
        grad[loser] -= g

    r -= lr * grad / len(pairs)
    r -= r.mean()  # choose one representative of the shift-equivalence class

pi_ref = np.array([0.35, 0.30, 0.20, 0.15])
beta = 0.7

def kl_regularized_policy(reward):
    logits = np.log(pi_ref) + reward / beta
    return softmax(logits)

pi_star = kl_regularized_policy(r)
pi_shifted = kl_regularized_policy(r + 10.0)

kl = np.sum(pi_star * (np.log(pi_star) - np.log(pi_ref)))

print("learned reward representative:", np.round(r, 3))
print("reference policy:             ", np.round(pi_ref, 3))
print("KL-regularized policy:        ", np.round(pi_star, 3))
print("KL(pi* || pi_ref):            ", round(float(kl), 4))

assert np.allclose(pi_star, pi_shifted)

Adding a constant to every reward changes neither pairwise preference probabilities nor the KL-regularized policy. The policy depends on reward differences and on how strongly $\beta$ anchors it to the reference.

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the demo as a probability-shaping machine. Change $\beta$ , toggle a proxy reward gap, and add a reward shift. Watch the reference policy get reweighted, and notice that shifting every reward leaves the policy unchanged.

Live Concept Demo

Explore RLHF: Reward Modeling + KL-Regularized Policy Optimization

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned

Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what RLHF: Reward Modeling + KL-Regularized Policy Optimization should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Prediction open01 / Intuition

Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change RLHF: Reward Modeling + KL-Regularized Policy Optimization should make visible.

Visual Inquiry

Make the image answer a mathematical question

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

4/4 stages readyLive demo connected

Visual cueWhich visible object should carry the first intuition?

Inspection depth2/4

Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make RLHF: Reward Modeling + KL-Regularized Policy Optimization easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

paper · 2017Deep reinforcement learning from human preferencesChristiano et al.

Grounds preference-labeled reward models as a way to train behavior from comparative human feedback.

Open source

paper · 2022Training language models to follow instructions with human feedbackOuyang et al.

Grounds the modern instruction-following RLHF pipeline: demonstrations, preference rankings, reward model, and PPO.

Open source

Claim Review

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources2 references

christiano-2017-human-preferences, ouyang-2022-instructgpt

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedPreference-based RLHF learns a reward model from human comparisons and optimizes a policy against it; InstructGPT-style RLHF adds SFT and PPO with a KL penalty to the SFT/reference policy.Claim metadata: source checked

Christiano trains a reward predictor from trajectory-segment comparisons and optimizes a policy on predicted reward. Ouyang describes demonstrations -> SFT, rankings -> reward model, and PPO against RM reward with per-token KL to the SFT policy. Local math/code/demo instantiate reward gaps, sigmoid preferences, pi_ref*exp(r/beta), KL, shift invariance, and proxy-gap warnings.

Sources: Deep reinforcement learning from human preferences, Training language models to follow instructions with human feedbackReviews only preference-modeling and KL-regularized optimization mechanics. It does not certify reward as true human objective, PPO exact attainment of the finite-action optimum, PPO-ptx/pretraining-gradient details, reward-hacking prevention, or broad alignment guarantees.A bounded review summary is present; still check caveats and exact source scope.

Christiano supports learning a reward predictor from pairwise trajectory preferences and optimizing a policy on predicted reward. Ouyang supports InstructGPT demonstrations -> SFT, rankings -> RM, and PPO against RM with per-token KL to SFT. Local math/code/demo are toy witnesses for sigmoid preferences, KL probability shaping, shift invariance, and proxy-gap caveats.

Reviewer: codex+oracle; reviewed 2026-05-07

source-span-christiano-2017-human-preferences source-span-ouyang-2022-instructgpt math-object-1 math-object-2 code-witness-1 interactive-demo

Source support candidates

paper 2017Deep reinforcement learning from human preferences

Grounds preference-labeled reward models as a way to train behavior from comparative human feedback.

paper 2022Training language models to follow instructions with human feedback

Grounds the modern instruction-following RLHF pipeline: demonstrations, preference rankings, reward model, and PPO.

Mechanism witnesses

Equation 1

\Delta_\phi(x,y_w,y_\ell) = r_\phi(x,y_w)-r_\phi(x,y_\ell), \qquad P_\phi(y_w\succ y_\ell\mid x)=\sigma(\Delta_\phi).

Equation 2

J_x(\pi)= \sum_y \pi(y\mid x)r_\phi(x,y) - \beta\mathrm{KL}(\pi(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)), \qquad \pi^*(y\mid x)= \frac{\pi_{\mathrm{ref}}(y\mid x)\exp(r_\phi(x,y)/\beta)} {\sum_{y'}\pi_{\mathrm{ref}}(y'\mid x)\exp(r_\phi(x,y')/\beta)}.

Code witness 1import numpy as np def sigmoid(z): return 1.0 / (1.0 + np.exp(-z)) def softmax(logits): logit...Demo stateLive mechanism probe

Practice Loop

Try the idea before it explains itself

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Readiness0/3 checks ready

Predict

Before touching the demo, predict one visible change that should happen in RLHF: Reward Modeling + KL-Regularized Policy Optimization.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Your answer canvas

Local checks

Claim

A concrete answer is on the canvas.

Mechanism

The answer names why the claim should hold.

Bridge

It touches the page context or a neighboring idea.

Misconception check

Object research drawerClose

ConceptRLHF: Reward Modeling + KL-Regularized Policy OptimizationAlignment

Code witness comparisonRLHF: Reward Modeling + KL-Regularized Policy Optimization code witness 1assert np.allclose(pi_star, pi_shifted)Prediction before revealRLHF: Reward Modeling + KL-Regularized Policy Optimization interactive demoManipulate one control and predict the visible change.

Grounded room questionWhat is the smallest example that makes RLHF: Reward Modeling + KL-Regularized Policy Optimization click without losing the math?Local snapshot ready

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.

Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptAlignment

RLHF: Reward Modeling + KL-Regularized Policy Optimization

Anchored question

What is the smallest example that makes RLHF: Reward Modeling + KL-Regularized Policy Optimization click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action

Local action draft

This draft stays locally in this browser for concept:alignment/rlhf.

Draft noteNext action

No local draft saved.

Evidence to inspect

Source ids to inspect: christiano-2017-human-preferences, ouyang-2022-instructgpt
Definition, prerequisite, and contrast concept links
The equation or code witness that makes the concept operational
One demo state that shows the invariant instead of a slogan

What would resolve this

The learner can state the mechanism in their own words
The learner can name the prerequisite that would repair confusion
The learner can predict how the mechanism changes under one perturbation

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - RLHF: Reward Modeling + KL-Regularized Policy Optimization Object key: concept:alignment/rlhf Context: Alignment Anchor id: concept/concept-notebook/alignment/rlhf Open question: What is the smallest example that makes RLHF: Reward Modeling + KL-Regularized Policy Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: christiano-2017-human-preferences, ouyang-2022-instructgpt - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object

concept/concept-notebook/alignment/rlhf
concept:alignment/rlhf

Learning Map

Before / Now / Try / Next

BeforeMaximum Likelihood

NowIntuition → Math → Code → Demo

TryManipulate one control and predict the visible change.

NextDirect Preference Optimization

Intuitionready
Mathready
Codeready
Interactive Demoready

Object Companion

Ask beside the selected object

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Alignment concept. Learning surface: RLHF: Reward Modeling + KL-Regularized Policy Optimization. What this page says: RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift. Current section: Intuition, math, code, and interactive demo. Suggested next step: Manipulate one control and predict the visible change.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Explain the central idea in plain language, then restate it with the exact math objects from the page. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.

Domain

Alignment

alignmentrlhfreward-modelingpreferencesbradley-terrykl-regularization

Prerequisites

Maximum Likelihood Cross-Entropy KL Divergence (Relative Entropy)

Leads To

Direct Preference Optimization Reward Hacking: Overoptimizing Preference Proxies Process Reward Models: Step-Level Verifiers for Reasoning

Kahneman-Tversky Optimization

Within this domain