Alignment

Direct Preference Optimization

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 18mlive demo

Back to Alignment Next: Kahneman-Tversky Optimization

Editorial alignment illustration of a direct preference comparison tilting chosen and rejected probability paths.

Concept Structure

Direct Preference Optimization

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites

1next concepts

1related links

Learning map

Direct Preference Optimization

BeforeCross-EntropyNow4/4 sections readyTryManipulate one control and predict the visible change.NextKahneman-Tversky Optimization

Object flow

4/4 sections readyAsk about this Research room

ConceptDirect Preference OptimizationAlignment EquationDirect Preference Optimization equation 1Exact equation object CodeDirect Preference Optimization code witness 1Exact code witness DemoDirect Preference Optimization interactive demoVisualization object ClaimDPO reparameterizes the KL-regularized RLHF reward as beta times the...Exact claim check SourceDirect Preference Optimization: Your Language Model is Secretly a Rew...Exact source object

ConceptDirect Preference OptimizationAlignment

1 source attachedLocal snapshot ready

concept:alignment/dpo

Codewitness nearby Predictbefore reveal Roomobject handoff

Conceptual Bridge

What should feel connected as you move through this page.

Carry inCross-Entropy

Bring the mental model from Cross-Entropy; this page will reuse it instead of restarting from zero.

Work hereDirect Preference Optimization

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Carry outKahneman-Tversky Optimization

The next edge should feel earned: use the demo prediction here before following Kahneman-Tversky Optimization.

Test the linkManipulate one control and predict the visible change.Then continue to Kahneman-Tversky Optimization

01IntuitionStart with the picture, metaphor, or geometric mechanism.02MathMake the objects explicit and connect them with notation.03CodeMirror the equations with runnable implementation details.04Interactive DemoManipulate the mechanism and watch the idea respond.

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Suppose a prompt $x$ has two candidate completions:

$y_w$ : the preferred completion
$y_\ell$ : the rejected completion

RLHF usually trains a reward model $r_\phi(x,y)$ , then runs an RL algorithm to make the policy choose higher-reward outputs while staying close to a reference model.

DPO asks a sharper question: can the policy itself act as the reward model?

Under the KL-regularized RLHF objective, the answer is yes, but only up to a prompt-only constant. If a policy has moved away from the reference model, that movement gives a convenient representative of an implicit reward class:

\hat r_\theta(x,y) = \beta\log \frac{\pi_\theta(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)}.

The exact reward can differ from this representative by a term that depends only on $x$ . Only reward differences matter for preference pairs, so that prompt-only term cancels, and DPO compares how much the current policy has changed the winner-vs-loser odds relative to the reference policy.

The core idea is:

A preferred response should beat a rejected response by more than the reference model already made it beat it.

That "more than the reference" phrase is the whole mechanism. DPO is not just "make the winner more likely." It is binary cross-entropy on reference-relative log odds.

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Equation 1\hat r_\theta(x,y) = \beta\log\frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}.Equation 2\log \pi_\theta(y\mid x) = \sum_t \log \pi_\theta(y_t\mid x,y_{<t}).

The source-level representative is the beta-scaled policy/reference log-ratio:

\hat r_\theta(x,y) = \beta\log\frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}.

Setup and shapes

For a prompt $x$ , $\pi_\theta(y\mid x)$ is the full sequence probability of completion $y$ . In an autoregressive language model,

\log \pi_\theta(y\mid x) = \sum_t \log \pi_\theta(y_t\mid x,y_{<t}).

A preference datum is $(x,y_w,y_\ell)$ . All compared completions must have positive probability under the reference policy for the log ratio to be finite.

Define the policy and reference log-odds:

a_\theta = \log\pi_\theta(y_w\mid x) - \log\pi_\theta(y_\ell\mid x),

and

a_{\mathrm{ref}} = \log\pi_{\mathrm{ref}}(y_w\mid x) - \log\pi_{\mathrm{ref}}(y_\ell\mid x).

The reference-relative margin is

m_\theta=a_\theta-a_{\mathrm{ref}}.

DPO predicts the preference probability as

P_\theta(y_w\succ y_\ell\mid x) = \sigma(\beta m_\theta).

For one hard winner/loser label, the DPO loss is

\mathcal L_{\mathrm{DPO}} = -\log\sigma(\beta m_\theta).

Let $s_\theta=\sigma(\beta m_\theta)$ . The browser demo below uses a soft target $p^*$ so the finite target log-odds is visible:

\mathcal L_{\mathrm{soft}} = -p^*\log s_\theta -(1-p^*)\log(1-s_\theta).

Ordinary hard-label DPO is the edge case $p^*\to 1$ , where an isolated two-response target diverges.

From KL-regularized RLHF to DPO

For a fixed prompt $x$ , write $K_x(\pi)$ for the distribution-level KL from a candidate policy to the reference policy at that prompt. The direction is $\pi(\cdot\mid x)$ to $\pi_{\mathrm{ref}}(\cdot\mid x)$ :

K_x(\pi) = \mathrm{KL}_x(\pi\|\pi_{\mathrm{ref}}).

Also write the expected reward as

R_x(\pi) = \mathbb E_{y\sim\pi(\cdot\mid x)}[r(x,y)].

The KL-regularized RLHF objective is to maximize

J_x(\pi) = R_x(\pi)-\beta K_x(\pi).

Its optimizer has the reweighted form

\pi_r(y\mid x) = \frac{\pi_{\mathrm{ref}}(y\mid x)\exp(r(x,y)/\beta)} {Z(x)}.

Rearrange it:

r(x,y) = \beta\log \frac{\pi_r(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)} + \beta\log Z(x).

The partition term $Z(x)$ depends on the prompt, not on which completion won. Bradley-Terry preferences use reward differences, so the prompt-only term cancels between $y_w$ and $y_\ell$ .

Bradley-Terry likelihood

Bradley-Terry models a preference as

P(y_w\succ y_\ell\mid x) = \sigma(r(x,y_w)-r(x,y_\ell)).

Substituting the implicit reward gives the DPO probability. With the margin $m_\theta=a_\theta-a_{\mathrm{ref}}$ from above:

P_\theta(y_w\succ y_\ell\mid x) =\sigma(\beta m_\theta).

Maximum likelihood on preference labels gives the DPO loss. This is why DPO is a supervised objective even though it comes from a KL-regularized RLHF derivation.

What beta does

Here $\beta$ follows the DPO paper's convention: it is the coefficient on the KL penalty in the KL-regularized RLHF objective.

Larger $\beta$ means a stronger reference anchor in the underlying RLHF objective.
Smaller $\beta$ means a given preference probability requires a larger policy/reference log-ratio gap.
In the DPO loss, $\beta$ also scales the sigmoid margin, so it affects optimization dynamics. Do not read "larger $\beta$ " as simply "trust preferences more."

For a soft preference probability $p^*$ , matching $p^*$ requires

m^*=\frac{\operatorname{logit}(p^*)}{\beta}.

Limits of the mechanism

DPO removes the explicit reward-model training stage and the RL loop. It does not remove the assumptions behind preference learning.

It still relies on a pairwise preference model such as Bradley-Terry. It only identifies rewards up to prompt-dependent constants. It needs a usable reference policy. And in an isolated two-response hard-label setting, the logistic loss can keep increasing the winner-over-loser log-ratio rather than settling at a finite target.

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

Code witness 1import numpy as np def softmax(z): z = z - z.max() e = np.exp(z) return e / e.sum() def sigmo...python

This finite-action witness verifies the derivation directly in a clean realizable toy setting. A synthetic reward defines the KL-regularized target policy. DPO sees full-support, exact soft Bradley-Terry preference probabilities and recovers the same unconstrained categorical policy through reference-relative log-ratio cross-entropy. Sampled hard labels, missing pairs, or a restricted neural parameterization would not make the equality exact in finite data.

import numpy as np

def softmax(z):
    z = z - z.max()
    e = np.exp(z)
    return e / e.sum()

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def kl(p, q):
    return float(np.sum(p * (np.log(p) - np.log(q))))

# One prompt, three possible completions.
# Shapes: ref, reward, logits, pi are all (K,).
ref = np.array([0.15, 0.70, 0.15], dtype=float)
reward = np.array([1.0, 0.2, -0.5], dtype=float)
beta = 0.5
log_ref = np.log(ref)

# KL-regularized RLHF optimum:
# pi*(y|x) is proportional to pi_ref(y|x) exp(r(x,y)/beta).
pi_star = ref * np.exp(reward / beta)
pi_star = pi_star / pi_star.sum()

# DPO sees pairwise preferences. We use exact Bradley-Terry
# probabilities instead of sampled hard labels so the equality is visible.
pairs = [(0, 1), (0, 2), (1, 2)]

# Initialize policy at the reference.
logits = log_ref.copy()
lr = 0.2

for _ in range(3000):
    grad = np.zeros_like(logits)

    for i, j in pairs:
        rel_log_odds = (logits[i] - logits[j]) - (log_ref[i] - log_ref[j])
        pred = sigmoid(beta * rel_log_odds)

        target = sigmoid(reward[i] - reward[j])

        # Binary cross-entropy gradient for the soft preference target.
        g = beta * (pred - target)
        grad[i] += g
        grad[j] -= g

    logits -= lr * grad / len(pairs)
    logits -= logits.mean()  # logits are identifiable only up to a constant

pi = softmax(logits)

print("reference policy:      ", np.round(ref, 3))
print("KL-reg optimum:        ", np.round(pi_star, 3))
print("DPO learned policy:    ", np.round(pi, 3))
print("KL(pi || ref):         ", round(kl(pi, ref), 4))

assert np.allclose(pi, pi_star, atol=1e-6)

for i, j in pairs:
    implicit_gap = beta * (
        (np.log(pi[i]) - np.log(ref[i]))
        - (np.log(pi[j]) - np.log(ref[j]))
    )
    reward_gap = reward[i] - reward[j]
    assert abs(implicit_gap - reward_gap) < 1e-6

for i, j in pairs:
    dpo_pref = sigmoid(beta * (
        (np.log(pi[i]) - np.log(ref[i]))
        - (np.log(pi[j]) - np.log(ref[j]))
    ))
    bt_pref = sigmoid(reward[i] - reward[j])
    assert abs(dpo_pref - bt_pref) < 1e-6

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the demo as a ratio machine. It collapses the world to two completions, $y_w$ and $y_\ell$ , so its KL is the binary KL over this toy pair, not the full language-model KL over all completions. Change the reference winner probability, the soft target preference, and $\beta$ . Watch how the required policy log-odds moves relative to the reference, how the DPO probability changes, and when the two-action KL from the reference spikes.

Live Concept Demo

Explore Direct Preference Optimization

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned

Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Direct Preference Optimization should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Prediction open01 / Intuition

Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Direct Preference Optimization should make visible.

Visual Inquiry

Make the image answer a mathematical question

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

4/4 stages readyLive demo connected

Visual cueWhich visible object should carry the first intuition?

Inspection depth2/4

Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Direct Preference Optimization easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

paper · 2023Direct Preference Optimization: Your Language Model is Secretly a Reward ModelRafailov et al.

Grounds DPO as a closed-form preference objective derived from KL-regularized RLHF.

Open source

Claim Review

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources1 reference

rafailov-2023-dpo

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedDPO reparameterizes the KL-regularized RLHF reward as beta times the policy/reference log-ratio, up to a prompt-only constant, then fits preference pairs with binary cross-entropy on the beta-scaled winner-loser reference-relative log-odds margin.Claim metadata: source checked

Rafailov et al. derive the KL-regularized optimum pi_r proportional to pi_ref exp(r/beta), rearrange it as r=beta log(pi_r/pi_ref)+beta log Z(x), then substitute into Bradley-Terry pairwise preferences so Z(x) cancels, yielding the DPO hard-label logistic/BCE loss on beta times the policy/reference winner-loser log-ratio difference. The page's math, code, and demo instantiate this margin as teaching witnesses.

Sources: Direct Preference Optimization: Your Language Model is Secretly a Reward ModelChecks pairwise Bradley-Terry DPO derivation plus finite-action/two-completion intuition only; not finite-sample convergence, noisy hard labels, neural realizability, ranking variants, empirical superiority to PPO/RLHF, reward-hacking prevention, or broader alignment guarantees.A bounded review summary is present; still check caveats and exact source scope.

Rafailov et al. derive pi_r proportional to pi_ref exp(r/beta), rearrange r=beta log(pi_r/pi_ref)+beta log Z(x), then cancel Z(x) in Bradley-Terry reward differences to obtain DPO BCE/logistic loss on beta-scaled policy-vs-reference winner-loser log-odds.

Reviewer: codex+oracle; reviewed 2026-05-07

source-span-rafailov-2023-dpo math-object-1 code-witness-1 interactive-demo

Source support candidates

paper 2023Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Grounds DPO as a closed-form preference objective derived from KL-regularized RLHF.

Mechanism witnesses

Equation 1

\hat r_\theta(x,y) = \beta\log\frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}.

Equation 2

\log \pi_\theta(y\mid x) = \sum_t \log \pi_\theta(y_t\mid x,y_{<t}).

Code witness 1import numpy as np def softmax(z): z = z - z.max() e = np.exp(z) return e / e.sum() def sigmo...Demo stateLive mechanism probe

Practice Loop

Try the idea before it explains itself

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Readiness0/3 checks ready

Predict

Before touching the demo, predict one visible change that should happen in Direct Preference Optimization.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Your answer canvas

Local checks

Claim

A concrete answer is on the canvas.

Mechanism

The answer names why the claim should hold.

Bridge

It touches the page context or a neighboring idea.

Misconception check

Object research drawerClose

ConceptDirect Preference OptimizationAlignment

Code witness comparisonDirect Preference Optimization code witness 1assert np.allclose(pi, pi_star, atol=1e-6)Prediction before revealDirect Preference Optimization interactive demoManipulate one control and predict the visible change.

Grounded room questionWhat is the smallest example that makes Direct Preference Optimization click without losing the math?Local snapshot ready

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.

Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptAlignment

Direct Preference Optimization

Anchored question

What is the smallest example that makes Direct Preference Optimization click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action

Local action draft

This draft stays locally in this browser for concept:alignment/dpo.

Draft noteNext action

No local draft saved.

Evidence to inspect

Source ids to inspect: rafailov-2023-dpo
Definition, prerequisite, and contrast concept links
The equation or code witness that makes the concept operational
One demo state that shows the invariant instead of a slogan

What would resolve this

The learner can state the mechanism in their own words
The learner can name the prerequisite that would repair confusion
The learner can predict how the mechanism changes under one perturbation

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Direct Preference Optimization Object key: concept:alignment/dpo Context: Alignment Anchor id: concept/concept-notebook/alignment/dpo Open question: What is the smallest example that makes Direct Preference Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: rafailov-2023-dpo - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object

concept/concept-notebook/alignment/dpo
concept:alignment/dpo

Learning Map

Before / Now / Try / Next

BeforeCross-Entropy

NowIntuition → Math → Code → Demo

TryManipulate one control and predict the visible change.

NextKahneman-Tversky Optimization

Intuitionready
Mathready
Codeready
Interactive Demoready

Object Companion

Ask beside the selected object

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Alignment concept. Learning surface: Direct Preference Optimization. What this page says: DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model. Current section: Intuition, math, code, and interactive demo. Suggested next step: Manipulate one control and predict the visible change.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Explain the central idea in plain language, then restate it with the exact math objects from the page. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.

Domain

Alignment

alignmentdpopreferencesbradley-terrylog-oddskl-regularization

Prerequisites

Cross-Entropy KL Divergence (Relative Entropy)RLHF: Reward Modeling + KL-Regularized Policy Optimization

Leads To

Kahneman-Tversky Optimization

Reward Hacking: Overoptimizing Preference Proxies

Within this domain