Bring the mental model from Cross-Entropy; this page will reuse it instead of restarting from zero.
Alignment
Direct Preference Optimization
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Concept Structure
Direct Preference Optimization
Start with the picture, metaphor, or geometric mechanism.
Make the objects explicit and connect them with notation.
Mirror the equations with runnable implementation details.
Manipulate the mechanism and watch the idea respond.
Learning map
Direct Preference OptimizationConceptual Bridge
What should feel connected as you move through this page.
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.
The next edge should feel earned: use the demo prediction here before following Kahneman-Tversky Optimization.
01
Intuition
Build the mental picture first so the rest of the page has something to attach to.
Suppose a prompt has two candidate completions:
- : the preferred completion
- : the rejected completion
RLHF usually trains a reward model , then runs an RL algorithm to make the policy choose higher-reward outputs while staying close to a reference model.
DPO asks a sharper question: can the policy itself act as the reward model?
Under the KL-regularized RLHF objective, the answer is yes, but only up to a prompt-only constant. If a policy has moved away from the reference model, that movement gives a convenient representative of an implicit reward class:
The exact reward can differ from this representative by a term that depends only on . Only reward differences matter for preference pairs, so that prompt-only term cancels, and DPO compares how much the current policy has changed the winner-vs-loser odds relative to the reference policy.
The core idea is:
A preferred response should beat a rejected response by more than the reference model already made it beat it.
That "more than the reference" phrase is the whole mechanism. DPO is not just "make the winner more likely." It is binary cross-entropy on reference-relative log odds.
02
Math
Translate the story into symbols, assumptions, and a derivation you can inspect.
The source-level representative is the beta-scaled policy/reference log-ratio:
Setup and shapes
For a prompt , is the full sequence probability of completion . In an autoregressive language model,
A preference datum is . All compared completions must have positive probability under the reference policy for the log ratio to be finite.
Define the policy and reference log-odds:
and
The reference-relative margin is
DPO predicts the preference probability as
For one hard winner/loser label, the DPO loss is
Let . The browser demo below uses a soft target so the finite target log-odds is visible:
Ordinary hard-label DPO is the edge case , where an isolated two-response target diverges.
From KL-regularized RLHF to DPO
For a fixed prompt , write for the distribution-level KL from a candidate policy to the reference policy at that prompt. The direction is to :
Also write the expected reward as
The KL-regularized RLHF objective is to maximize
Its optimizer has the reweighted form
Rearrange it:
The partition term depends on the prompt, not on which completion won. Bradley-Terry preferences use reward differences, so the prompt-only term cancels between and .
Bradley-Terry likelihood
Bradley-Terry models a preference as
Substituting the implicit reward gives the DPO probability. With the margin from above:
Maximum likelihood on preference labels gives the DPO loss. This is why DPO is a supervised objective even though it comes from a KL-regularized RLHF derivation.
What beta does
Here follows the DPO paper's convention: it is the coefficient on the KL penalty in the KL-regularized RLHF objective.
- Larger means a stronger reference anchor in the underlying RLHF objective.
- Smaller means a given preference probability requires a larger policy/reference log-ratio gap.
- In the DPO loss, also scales the sigmoid margin, so it affects optimization dynamics. Do not read "larger " as simply "trust preferences more."
For a soft preference probability , matching requires
Limits of the mechanism
DPO removes the explicit reward-model training stage and the RL loop. It does not remove the assumptions behind preference learning.
It still relies on a pairwise preference model such as Bradley-Terry. It only identifies rewards up to prompt-dependent constants. It needs a usable reference policy. And in an isolated two-response hard-label setting, the logistic loss can keep increasing the winner-over-loser log-ratio rather than settling at a finite target.
03
Code
Keep the implementation aligned with the notation so the algorithm is legible.
This finite-action witness verifies the derivation directly in a clean realizable toy setting. A synthetic reward defines the KL-regularized target policy. DPO sees full-support, exact soft Bradley-Terry preference probabilities and recovers the same unconstrained categorical policy through reference-relative log-ratio cross-entropy. Sampled hard labels, missing pairs, or a restricted neural parameterization would not make the equality exact in finite data.
import numpy as np
def softmax(z):
z = z - z.max()
e = np.exp(z)
return e / e.sum()
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def kl(p, q):
return float(np.sum(p * (np.log(p) - np.log(q))))
# One prompt, three possible completions.
# Shapes: ref, reward, logits, pi are all (K,).
ref = np.array([0.15, 0.70, 0.15], dtype=float)
reward = np.array([1.0, 0.2, -0.5], dtype=float)
beta = 0.5
log_ref = np.log(ref)
# KL-regularized RLHF optimum:
# pi*(y|x) is proportional to pi_ref(y|x) exp(r(x,y)/beta).
pi_star = ref * np.exp(reward / beta)
pi_star = pi_star / pi_star.sum()
# DPO sees pairwise preferences. We use exact Bradley-Terry
# probabilities instead of sampled hard labels so the equality is visible.
pairs = [(0, 1), (0, 2), (1, 2)]
# Initialize policy at the reference.
logits = log_ref.copy()
lr = 0.2
for _ in range(3000):
grad = np.zeros_like(logits)
for i, j in pairs:
rel_log_odds = (logits[i] - logits[j]) - (log_ref[i] - log_ref[j])
pred = sigmoid(beta * rel_log_odds)
target = sigmoid(reward[i] - reward[j])
# Binary cross-entropy gradient for the soft preference target.
g = beta * (pred - target)
grad[i] += g
grad[j] -= g
logits -= lr * grad / len(pairs)
logits -= logits.mean() # logits are identifiable only up to a constant
pi = softmax(logits)
print("reference policy: ", np.round(ref, 3))
print("KL-reg optimum: ", np.round(pi_star, 3))
print("DPO learned policy: ", np.round(pi, 3))
print("KL(pi || ref): ", round(kl(pi, ref), 4))
assert np.allclose(pi, pi_star, atol=1e-6)
for i, j in pairs:
implicit_gap = beta * (
(np.log(pi[i]) - np.log(ref[i]))
- (np.log(pi[j]) - np.log(ref[j]))
)
reward_gap = reward[i] - reward[j]
assert abs(implicit_gap - reward_gap) < 1e-6
for i, j in pairs:
dpo_pref = sigmoid(beta * (
(np.log(pi[i]) - np.log(ref[i]))
- (np.log(pi[j]) - np.log(ref[j]))
))
bt_pref = sigmoid(reward[i] - reward[j])
assert abs(dpo_pref - bt_pref) < 1e-6
04
Interactive Demo
Use direct manipulation to connect the explanation to a moving system.
Use the demo as a ratio machine. It collapses the world to two completions, and , so its KL is the binary KL over this toy pair, not the full language-model KL over all completions. Change the reference winner probability, the soft target preference, and . Watch how the required policy log-odds moves relative to the reference, how the DPO probability changes, and when the two-action KL from the reference spikes.
Live Concept Demo
Explore Direct Preference Optimization
The stage is code-native and interactive. Use it to test the explanation against the mechanism.
Manipulate one control and predict the visible change.
Commit to what Direct Preference Optimization should make visible before reading the result.
After The First Pass
Turn the concept into an inspected object.
Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.
Mechanism Storyboard
See the idea move before the page explains it
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Start with the picture, metaphor, or geometric mechanism.
Before reading further, choose the kind of change Direct Preference Optimization should make visible.
Visual Inquiry
Make the image answer a mathematical question
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.
Which visible object should carry the first intuition?
Pick the cue that should make Direct Preference Optimization easier to reason about before the page gives the answer.
Source Grounding
Canonical references for the mechanism on this page.
Grounds DPO as a closed-form preference objective derived from KL-regularized RLHF.
Open sourceClaim Review
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.
Claims without a substantive review badge still need exact source-support review.
rafailov-2023-dpo
Use equation, code, and demo objects to check whether the source support is operational.
Rafailov et al. derive the KL-regularized optimum pi_r proportional to pi_ref exp(r/beta), rearrange it as r=beta log(pi_r/pi_ref)+beta log Z(x), then substitute into Bradley-Terry pairwise preferences so Z(x) cancels, yielding the DPO hard-label logistic/BCE loss on beta times the policy/reference winner-loser log-ratio difference. The page's math, code, and demo instantiate this margin as teaching witnesses.
Sources: Direct Preference Optimization: Your Language Model is Secretly a Reward ModelChecks pairwise Bradley-Terry DPO derivation plus finite-action/two-completion intuition only; not finite-sample convergence, noisy hard labels, neural realizability, ranking variants, empirical superiority to PPO/RLHF, reward-hacking prevention, or broader alignment guarantees.A bounded review summary is present; still check caveats and exact source scope.Rafailov et al. derive pi_r proportional to pi_ref exp(r/beta), rearrange r=beta log(pi_r/pi_ref)+beta log Z(x), then cancel Z(x) in Bradley-Terry reward differences to obtain DPO BCE/logistic loss on beta-scaled policy-vs-reference winner-loser log-odds.
Reviewer: codex+oracle; reviewed 2026-05-07Practice Loop
Try the idea before it explains itself
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.
Before touching the demo, predict one visible change that should happen in Direct Preference Optimization.
Reveal when your model needs a nudge.
Reveal when your model needs a nudge.
Reveal when your model needs a nudge.
A concrete answer is on the canvas.
The answer names why the claim should hold.
It touches the page context or a neighboring idea.
Research Room
Attach the question to an exact object
Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.Open the draft below to save one note and next action in this browser.
Direct Preference Optimization
What is the smallest example that makes Direct Preference Optimization click without losing the math?
Local action draftNo local draft saved yetExpand only when ready to capture one local next action
This draft stays locally in this browser for concept:alignment/dpo.
- Source ids to inspect: rafailov-2023-dpo
- Definition, prerequisite, and contrast concept links
- The equation or code witness that makes the concept operational
- One demo state that shows the invariant instead of a slogan
- The learner can state the mechanism in their own words
- The learner can name the prerequisite that would repair confusion
- The learner can predict how the mechanism changes under one perturbation
I am working in Continuous Function's research reading room. Object: concept - Direct Preference Optimization Object key: concept:alignment/dpo Context: Alignment Anchor id: concept/concept-notebook/alignment/dpo Open question: What is the smallest example that makes Direct Preference Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: rafailov-2023-dpo - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.
concept/concept-notebook/alignment/dpo
concept:alignment/dpo