Bring the mental model from Maximum Likelihood; this page will reuse it instead of restarting from zero.
Alignment
RLHF: Reward Modeling + KL-Regularized Policy Optimization
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Concept Structure
RLHF: Reward Modeling + KL-Regularized Policy Optimization
Start with the picture, metaphor, or geometric mechanism.
Make the objects explicit and connect them with notation.
Mirror the equations with runnable implementation details.
Manipulate the mechanism and watch the idea respond.
Learning map
RLHF: Reward Modeling + KL-Regularized Policy OptimizationConceptual Bridge
What should feel connected as you move through this page.
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.
The next edge should feel earned: use the demo prediction here before following Direct Preference Optimization.
01
Intuition
Build the mental picture first so the rest of the page has something to attach to.
Canonical sources: Christiano et al., "Deep Reinforcement Learning from Human Preferences", and Ouyang et al., "Training language models to follow instructions with human feedback".
A pretrained or supervised-finetuned language model gives a distribution over completions. In InstructGPT-style RLHF, the reference policy is often the supervised-finetuned model trained from demonstrations. RLHF asks:
When humans prefer one completion over another, how should that preference move probability mass?
The mechanism has two stages.
First, train a reward model from comparisons. It does not learn an absolute moral score; it learns differences that make preferred completions more likely under a pairwise preference model.
Second, optimize a policy against that learned reward while penalizing movement away from a reference model. In the finite-action picture:
RLHF multiplies the reference probability of each completion by an exponential reward bonus, then renormalizes.
High reward pulls probability upward. The KL penalty controls how far the new policy may drift. If the reward model is a proxy with exploitable errors, optimizing too aggressively can move probability mass toward outputs that score well under the proxy but are worse under the real target.
02
Math
Translate the story into symbols, assumptions, and a derivation you can inspect.
The two moving pieces are the preference-trained reward gap and the KL-shaped policy update:
The rest of this section unpacks those two witnesses.
Preference data and reward-model likelihood
Let a preference datum be , where is preferred to for prompt . Define the reward gap
A reward model predicts
The negative log-likelihood is
where
Only reward differences inside the same prompt are observed. Therefore gives the same pairwise probabilities as .
Reward scale also needs a convention. With a fixed logistic noise model, reward gaps are measured in that model's log-odds units. With perfectly separable hard preferences, an unregularized reward model can drive gaps toward infinity. Practical systems therefore normalize, regularize, or otherwise choose a usable reward scale before policy optimization.
KL-regularized policy optimization
For a fixed prompt , reference policy , and candidate policy , define expected reward
and the KL term
The RLHF objective is
Equivalently, if
then
Assume on the candidate support. Optimize over the finite distribution subject to . The Lagrange stationarity condition is
Solving for and normalizing gives
Here
Larger keeps the policy closer to the reference. Smaller lets learned reward dominate.
Rearranging gives the bridge to DPO:
The final term depends only on the prompt, so it cancels in preference differences.
PPO is the optimizer, not the definition
In language-model RLHF, the candidate space is enormous. InstructGPT-style RLHF uses PPO to optimize the learned reward with a KL penalty to the supervised-finetuned or reference model. The clean finite-action formula above is the exact optimum of a finite-action regularized objective. PPO is a stochastic optimizer for the language-model version; it is not this closed-form update, and a parametric policy with per-token KL details need not trace the toy optimum exactly.
Reward hacking
If differs from the real target , then low can concentrate the policy on outputs with high proxy reward and low true utility. KL regularization reduces this pressure but does not make the proxy correct.
03
Code
Keep the implementation aligned with the notation so the algorithm is legible.
This witness keeps the pieces visible: pairwise reward-model fitting, reward-shift invariance, and the KL-regularized policy that reweights a reference distribution.
import numpy as np
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
def softmax(logits):
logits = logits - logits.max()
e = np.exp(logits)
return e / e.sum()
# One prompt, four candidate completions.
# Shape: rewards, pi_ref, pi_star are all (K,).
pairs = np.array([
[0, 1],
[0, 2],
[2, 1],
[3, 1],
[0, 3],
])
K = 4
r = np.zeros(K)
lr = 0.3
l2 = 0.05
for _ in range(800):
grad = l2 * r
for winner, loser in pairs:
margin = r[winner] - r[loser]
p = sigmoid(margin)
g = p - 1.0
grad[winner] += g
grad[loser] -= g
r -= lr * grad / len(pairs)
r -= r.mean() # choose one representative of the shift-equivalence class
pi_ref = np.array([0.35, 0.30, 0.20, 0.15])
beta = 0.7
def kl_regularized_policy(reward):
logits = np.log(pi_ref) + reward / beta
return softmax(logits)
pi_star = kl_regularized_policy(r)
pi_shifted = kl_regularized_policy(r + 10.0)
kl = np.sum(pi_star * (np.log(pi_star) - np.log(pi_ref)))
print("learned reward representative:", np.round(r, 3))
print("reference policy: ", np.round(pi_ref, 3))
print("KL-regularized policy: ", np.round(pi_star, 3))
print("KL(pi* || pi_ref): ", round(float(kl), 4))
assert np.allclose(pi_star, pi_shifted)
Adding a constant to every reward changes neither pairwise preference probabilities nor the KL-regularized policy. The policy depends on reward differences and on how strongly anchors it to the reference.
04
Interactive Demo
Use direct manipulation to connect the explanation to a moving system.
Use the demo as a probability-shaping machine. Change , toggle a proxy reward gap, and add a reward shift. Watch the reference policy get reweighted, and notice that shifting every reward leaves the policy unchanged.
Live Concept Demo
Explore RLHF: Reward Modeling + KL-Regularized Policy Optimization
The stage is code-native and interactive. Use it to test the explanation against the mechanism.
Manipulate one control and predict the visible change.
Commit to what RLHF: Reward Modeling + KL-Regularized Policy Optimization should make visible before reading the result.
After The First Pass
Turn the concept into an inspected object.
Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.
Mechanism Storyboard
See the idea move before the page explains it
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Start with the picture, metaphor, or geometric mechanism.
Before reading further, choose the kind of change RLHF: Reward Modeling + KL-Regularized Policy Optimization should make visible.
Visual Inquiry
Make the image answer a mathematical question
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.
Which visible object should carry the first intuition?
Pick the cue that should make RLHF: Reward Modeling + KL-Regularized Policy Optimization easier to reason about before the page gives the answer.
Source Grounding
Canonical references for the mechanism on this page.
Grounds preference-labeled reward models as a way to train behavior from comparative human feedback.
Open sourceGrounds the modern instruction-following RLHF pipeline: demonstrations, preference rankings, reward model, and PPO.
Open sourceClaim Review
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.
Claims without a substantive review badge still need exact source-support review.
christiano-2017-human-preferences, ouyang-2022-instructgpt
Use equation, code, and demo objects to check whether the source support is operational.
Christiano trains a reward predictor from trajectory-segment comparisons and optimizes a policy on predicted reward. Ouyang describes demonstrations -> SFT, rankings -> reward model, and PPO against RM reward with per-token KL to the SFT policy. Local math/code/demo instantiate reward gaps, sigmoid preferences, pi_ref*exp(r/beta), KL, shift invariance, and proxy-gap warnings.
Sources: Deep reinforcement learning from human preferences, Training language models to follow instructions with human feedbackReviews only preference-modeling and KL-regularized optimization mechanics. It does not certify reward as true human objective, PPO exact attainment of the finite-action optimum, PPO-ptx/pretraining-gradient details, reward-hacking prevention, or broad alignment guarantees.A bounded review summary is present; still check caveats and exact source scope.Christiano supports learning a reward predictor from pairwise trajectory preferences and optimizing a policy on predicted reward. Ouyang supports InstructGPT demonstrations -> SFT, rankings -> RM, and PPO against RM with per-token KL to SFT. Local math/code/demo are toy witnesses for sigmoid preferences, KL probability shaping, shift invariance, and proxy-gap caveats.
Reviewer: codex+oracle; reviewed 2026-05-07Source support candidates
paper 2017Deep reinforcement learning from human preferencesGrounds preference-labeled reward models as a way to train behavior from comparative human feedback.
paper 2022Training language models to follow instructions with human feedbackGrounds the modern instruction-following RLHF pipeline: demonstrations, preference rankings, reward model, and PPO.
Practice Loop
Try the idea before it explains itself
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.
Before touching the demo, predict one visible change that should happen in RLHF: Reward Modeling + KL-Regularized Policy Optimization.
Reveal when your model needs a nudge.
Reveal when your model needs a nudge.
Reveal when your model needs a nudge.
A concrete answer is on the canvas.
The answer names why the claim should hold.
It touches the page context or a neighboring idea.
Research Room
Attach the question to an exact object
Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.Open the draft below to save one note and next action in this browser.
RLHF: Reward Modeling + KL-Regularized Policy Optimization
What is the smallest example that makes RLHF: Reward Modeling + KL-Regularized Policy Optimization click without losing the math?
Local action draftNo local draft saved yetExpand only when ready to capture one local next action
This draft stays locally in this browser for concept:alignment/rlhf.
- Source ids to inspect: christiano-2017-human-preferences, ouyang-2022-instructgpt
- Definition, prerequisite, and contrast concept links
- The equation or code witness that makes the concept operational
- One demo state that shows the invariant instead of a slogan
- The learner can state the mechanism in their own words
- The learner can name the prerequisite that would repair confusion
- The learner can predict how the mechanism changes under one perturbation
I am working in Continuous Function's research reading room. Object: concept - RLHF: Reward Modeling + KL-Regularized Policy Optimization Object key: concept:alignment/rlhf Context: Alignment Anchor id: concept/concept-notebook/alignment/rlhf Open question: What is the smallest example that makes RLHF: Reward Modeling + KL-Regularized Policy Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: christiano-2017-human-preferences, ouyang-2022-instructgpt - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.
concept/concept-notebook/alignment/rlhf
concept:alignment/rlhf