Alignment

Direct Preference Optimization

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 18mlive demo
Editorial alignment illustration of a direct preference comparison tilting chosen and rejected probability paths.

Concept Structure

Direct Preference Optimization

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

3prerequisites
1next concepts
1related links

Learning map

Direct Preference Optimization
BeforeCross-EntropyNow4/4 sections readyTryManipulate one control and predict the visible change.NextKahneman-Tversky Optimization

Object flow

4/4 sections readyAsk about thisResearch room
ConceptDirect Preference OptimizationAlignment
1 source attachedLocal snapshot ready
concept:alignment/dpo

Conceptual Bridge

What should feel connected as you move through this page.

Carry inCross-Entropy

Bring the mental model from Cross-Entropy; this page will reuse it instead of restarting from zero.

Work hereDirect Preference Optimization

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Carry outKahneman-Tversky Optimization

The next edge should feel earned: use the demo prediction here before following Kahneman-Tversky Optimization.

Test the linkManipulate one control and predict the visible change.Then continue to Kahneman-Tversky Optimization
01

01

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Suppose a prompt xx has two candidate completions:

  • ywy_w: the preferred completion
  • yy_\ell: the rejected completion

RLHF usually trains a reward model rϕ(x,y)r_\phi(x,y), then runs an RL algorithm to make the policy choose higher-reward outputs while staying close to a reference model.

DPO asks a sharper question: can the policy itself act as the reward model?

Under the KL-regularized RLHF objective, the answer is yes, but only up to a prompt-only constant. If a policy has moved away from the reference model, that movement gives a convenient representative of an implicit reward class:

r^θ(x,y)=βlogπθ(yx)πref(yx).\hat r_\theta(x,y) = \beta\log \frac{\pi_\theta(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)}.

The exact reward can differ from this representative by a term that depends only on xx. Only reward differences matter for preference pairs, so that prompt-only term cancels, and DPO compares how much the current policy has changed the winner-vs-loser odds relative to the reference policy.

The core idea is:

A preferred response should beat a rejected response by more than the reference model already made it beat it.

That "more than the reference" phrase is the whole mechanism. DPO is not just "make the winner more likely." It is binary cross-entropy on reference-relative log odds.

02

02

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

The source-level representative is the beta-scaled policy/reference log-ratio:

r^θ(x,y)=βlogπθ(yx)πref(yx).\hat r_\theta(x,y) = \beta\log\frac{\pi_\theta(y\mid x)}{\pi_{\mathrm{ref}}(y\mid x)}.

Setup and shapes

For a prompt xx, πθ(yx)\pi_\theta(y\mid x) is the full sequence probability of completion yy. In an autoregressive language model,

logπθ(yx)=tlogπθ(ytx,y<t).\log \pi_\theta(y\mid x) = \sum_t \log \pi_\theta(y_t\mid x,y_{<t}).

A preference datum is (x,yw,y)(x,y_w,y_\ell). All compared completions must have positive probability under the reference policy for the log ratio to be finite.

Define the policy and reference log-odds:

aθ=logπθ(ywx)logπθ(yx),a_\theta = \log\pi_\theta(y_w\mid x) - \log\pi_\theta(y_\ell\mid x),

and

aref=logπref(ywx)logπref(yx).a_{\mathrm{ref}} = \log\pi_{\mathrm{ref}}(y_w\mid x) - \log\pi_{\mathrm{ref}}(y_\ell\mid x).

The reference-relative margin is

mθ=aθaref.m_\theta=a_\theta-a_{\mathrm{ref}}.

DPO predicts the preference probability as

Pθ(ywyx)=σ(βmθ).P_\theta(y_w\succ y_\ell\mid x) = \sigma(\beta m_\theta).

For one hard winner/loser label, the DPO loss is

LDPO=logσ(βmθ).\mathcal L_{\mathrm{DPO}} = -\log\sigma(\beta m_\theta).

Let sθ=σ(βmθ)s_\theta=\sigma(\beta m_\theta). The browser demo below uses a soft target pp^* so the finite target log-odds is visible:

Lsoft=plogsθ(1p)log(1sθ).\mathcal L_{\mathrm{soft}} = -p^*\log s_\theta -(1-p^*)\log(1-s_\theta).

Ordinary hard-label DPO is the edge case p1p^*\to 1, where an isolated two-response target diverges.

From KL-regularized RLHF to DPO

For a fixed prompt xx, write Kx(π)K_x(\pi) for the distribution-level KL from a candidate policy to the reference policy at that prompt. The direction is π(x)\pi(\cdot\mid x) to πref(x)\pi_{\mathrm{ref}}(\cdot\mid x):

Kx(π)=KLx(ππref).K_x(\pi) = \mathrm{KL}_x(\pi\|\pi_{\mathrm{ref}}).

Also write the expected reward as

Rx(π)=Eyπ(x)[r(x,y)].R_x(\pi) = \mathbb E_{y\sim\pi(\cdot\mid x)}[r(x,y)].

The KL-regularized RLHF objective is to maximize

Jx(π)=Rx(π)βKx(π).J_x(\pi) = R_x(\pi)-\beta K_x(\pi).

Its optimizer has the reweighted form

πr(yx)=πref(yx)exp(r(x,y)/β)Z(x).\pi_r(y\mid x) = \frac{\pi_{\mathrm{ref}}(y\mid x)\exp(r(x,y)/\beta)} {Z(x)}.

Rearrange it:

r(x,y)=βlogπr(yx)πref(yx)+βlogZ(x).r(x,y) = \beta\log \frac{\pi_r(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)} + \beta\log Z(x).

The partition term Z(x)Z(x) depends on the prompt, not on which completion won. Bradley-Terry preferences use reward differences, so the prompt-only term cancels between ywy_w and yy_\ell.

Bradley-Terry likelihood

Bradley-Terry models a preference as

P(ywyx)=σ(r(x,yw)r(x,y)).P(y_w\succ y_\ell\mid x) = \sigma(r(x,y_w)-r(x,y_\ell)).

Substituting the implicit reward gives the DPO probability. With the margin mθ=aθarefm_\theta=a_\theta-a_{\mathrm{ref}} from above:

Pθ(ywyx)=σ(βmθ).P_\theta(y_w\succ y_\ell\mid x) =\sigma(\beta m_\theta).

Maximum likelihood on preference labels gives the DPO loss. This is why DPO is a supervised objective even though it comes from a KL-regularized RLHF derivation.

What beta does

Here β\beta follows the DPO paper's convention: it is the coefficient on the KL penalty in the KL-regularized RLHF objective.

  • Larger β\beta means a stronger reference anchor in the underlying RLHF objective.
  • Smaller β\beta means a given preference probability requires a larger policy/reference log-ratio gap.
  • In the DPO loss, β\beta also scales the sigmoid margin, so it affects optimization dynamics. Do not read "larger β\beta" as simply "trust preferences more."

For a soft preference probability pp^*, matching pp^* requires

m=logit(p)β.m^*=\frac{\operatorname{logit}(p^*)}{\beta}.

Limits of the mechanism

DPO removes the explicit reward-model training stage and the RL loop. It does not remove the assumptions behind preference learning.

It still relies on a pairwise preference model such as Bradley-Terry. It only identifies rewards up to prompt-dependent constants. It needs a usable reference policy. And in an isolated two-response hard-label setting, the logistic loss can keep increasing the winner-over-loser log-ratio rather than settling at a finite target.

03

03

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

This finite-action witness verifies the derivation directly in a clean realizable toy setting. A synthetic reward defines the KL-regularized target policy. DPO sees full-support, exact soft Bradley-Terry preference probabilities and recovers the same unconstrained categorical policy through reference-relative log-ratio cross-entropy. Sampled hard labels, missing pairs, or a restricted neural parameterization would not make the equality exact in finite data.

import numpy as np

def softmax(z):
    z = z - z.max()
    e = np.exp(z)
    return e / e.sum()

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def kl(p, q):
    return float(np.sum(p * (np.log(p) - np.log(q))))

# One prompt, three possible completions.
# Shapes: ref, reward, logits, pi are all (K,).
ref = np.array([0.15, 0.70, 0.15], dtype=float)
reward = np.array([1.0, 0.2, -0.5], dtype=float)
beta = 0.5
log_ref = np.log(ref)

# KL-regularized RLHF optimum:
# pi*(y|x) is proportional to pi_ref(y|x) exp(r(x,y)/beta).
pi_star = ref * np.exp(reward / beta)
pi_star = pi_star / pi_star.sum()

# DPO sees pairwise preferences. We use exact Bradley-Terry
# probabilities instead of sampled hard labels so the equality is visible.
pairs = [(0, 1), (0, 2), (1, 2)]

# Initialize policy at the reference.
logits = log_ref.copy()
lr = 0.2

for _ in range(3000):
    grad = np.zeros_like(logits)

    for i, j in pairs:
        rel_log_odds = (logits[i] - logits[j]) - (log_ref[i] - log_ref[j])
        pred = sigmoid(beta * rel_log_odds)

        target = sigmoid(reward[i] - reward[j])

        # Binary cross-entropy gradient for the soft preference target.
        g = beta * (pred - target)
        grad[i] += g
        grad[j] -= g

    logits -= lr * grad / len(pairs)
    logits -= logits.mean()  # logits are identifiable only up to a constant

pi = softmax(logits)

print("reference policy:      ", np.round(ref, 3))
print("KL-reg optimum:        ", np.round(pi_star, 3))
print("DPO learned policy:    ", np.round(pi, 3))
print("KL(pi || ref):         ", round(kl(pi, ref), 4))

assert np.allclose(pi, pi_star, atol=1e-6)

for i, j in pairs:
    implicit_gap = beta * (
        (np.log(pi[i]) - np.log(ref[i]))
        - (np.log(pi[j]) - np.log(ref[j]))
    )
    reward_gap = reward[i] - reward[j]
    assert abs(implicit_gap - reward_gap) < 1e-6

for i, j in pairs:
    dpo_pref = sigmoid(beta * (
        (np.log(pi[i]) - np.log(ref[i]))
        - (np.log(pi[j]) - np.log(ref[j]))
    ))
    bt_pref = sigmoid(reward[i] - reward[j])
    assert abs(dpo_pref - bt_pref) < 1e-6
04

04

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the demo as a ratio machine. It collapses the world to two completions, ywy_w and yy_\ell, so its KL is the binary KL over this toy pair, not the full language-model KL over all completions. Change the reference winner probability, the soft target preference, and β\beta. Watch how the required policy log-odds moves relative to the reference, how the DPO probability changes, and when the two-action KL from the reference spikes.

Live Concept Demo

Explore Direct Preference Optimization

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned
Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Direct Preference Optimization should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Prediction open01 / Intuition
Editorial alignment illustration of a direct preference comparison tilting chosen and rejected probability paths.
Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Direct Preference Optimization should make visible.

Visual Inquiry

Make the image answer a mathematical question

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

4/4 stages readyLive demo connected
Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Direct Preference Optimization easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

paper · 2023Direct Preference Optimization: Your Language Model is Secretly a Reward ModelRafailov et al.

Grounds DPO as a closed-form preference objective derived from KL-regularized RLHF.

Open source

Claim Review

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources1 reference

rafailov-2023-dpo

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedDPO reparameterizes the KL-regularized RLHF reward as beta times the policy/reference log-ratio, up to a prompt-only constant, then fits preference pairs with binary cross-entropy on the beta-scaled winner-loser reference-relative log-odds margin.Claim metadata: source checked

Rafailov et al. derive the KL-regularized optimum pi_r proportional to pi_ref exp(r/beta), rearrange it as r=beta log(pi_r/pi_ref)+beta log Z(x), then substitute into Bradley-Terry pairwise preferences so Z(x) cancels, yielding the DPO hard-label logistic/BCE loss on beta times the policy/reference winner-loser log-ratio difference. The page's math, code, and demo instantiate this margin as teaching witnesses.

Sources: Direct Preference Optimization: Your Language Model is Secretly a Reward ModelChecks pairwise Bradley-Terry DPO derivation plus finite-action/two-completion intuition only; not finite-sample convergence, noisy hard labels, neural realizability, ranking variants, empirical superiority to PPO/RLHF, reward-hacking prevention, or broader alignment guarantees.A bounded review summary is present; still check caveats and exact source scope.

Rafailov et al. derive pi_r proportional to pi_ref exp(r/beta), rearrange r=beta log(pi_r/pi_ref)+beta log Z(x), then cancel Z(x) in Bradley-Terry reward differences to obtain DPO BCE/logistic loss on beta-scaled policy-vs-reference winner-loser log-odds.

Reviewer: codex+oracle; reviewed 2026-05-07

Practice Loop

Try the idea before it explains itself

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Readiness0/3 checks ready
Predict

Before touching the demo, predict one visible change that should happen in Direct Preference Optimization.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Object research drawerClose
ConceptDirect Preference OptimizationAlignment

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.
Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptAlignment

Direct Preference Optimization

Anchored question

What is the smallest example that makes Direct Preference Optimization click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action
Local action draft

This draft stays locally in this browser for concept:alignment/dpo.

No local draft saved.
Evidence to inspect
  • Source ids to inspect: rafailov-2023-dpo
  • Definition, prerequisite, and contrast concept links
  • The equation or code witness that makes the concept operational
  • One demo state that shows the invariant instead of a slogan
What would resolve this
  • The learner can state the mechanism in their own words
  • The learner can name the prerequisite that would repair confusion
  • The learner can predict how the mechanism changes under one perturbation
Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Direct Preference Optimization Object key: concept:alignment/dpo Context: Alignment Anchor id: concept/concept-notebook/alignment/dpo Open question: What is the smallest example that makes Direct Preference Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: rafailov-2023-dpo - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object
concept/concept-notebook/alignment/dpo concept:alignment/dpo