Alignment

Kahneman-Tversky Optimization

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 18mlive demo

Back to Alignment Next: Process Reward Models: Step-Level Verifiers for Reasoning

Editorial alignment illustration of pointwise desirable and undesirable feedback shaped by an asymmetric value curve.

Concept Structure

Kahneman-Tversky Optimization

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

1prerequisites

1next concepts

4related links

Learning map

Kahneman-Tversky Optimization

BeforeDirect Preference OptimizationNow4/4 sections readyTryManipulate one control and predict the visible change.NextProcess Reward Models: Step-Level Verifiers for Reasoning

Object flow

4/4 sections readyAsk about this Research room

ConceptKahneman-Tversky OptimizationAlignment EquationKahneman-Tversky Optimization equation 1Exact equation object CodeKahneman-Tversky Optimization code witness 1Exact code witness DemoKahneman-Tversky Optimization interactive demoVisualization object ClaimKTO trains from binary desirable/undesirable feedback using a label-d...Exact claim check SourceKTO: Model Alignment as Prospect Theoretic OptimizationExact source object

ConceptKahneman-Tversky OptimizationAlignment

1 source attachedLocal snapshot ready

concept:alignment/kto

Codewitness nearby Predictbefore reveal Roomobject handoff

Conceptual Bridge

What should feel connected as you move through this page.

Carry inDirect Preference Optimization

Bring the mental model from Direct Preference Optimization; this page will reuse it instead of restarting from zero.

Work hereKahneman-Tversky Optimization

Carry outProcess Reward Models: Step-Level Verifiers for Reasoning

The next edge should feel earned: use the demo prediction here before following Process Reward Models: Step-Level Verifiers for Reasoning.

Test the linkManipulate one control and predict the visible change.Then continue to Process Reward Models: Step-Level Verifiers for Reasoning

01IntuitionStart with the picture, metaphor, or geometric mechanism.02MathMake the objects explicit and connect them with notation.03CodeMirror the equations with runnable implementation details.04Interactive DemoManipulate the mechanism and watch the idea respond.

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Canonical source: Ethayarajh et al., "KTO: Model Alignment as Prospect Theoretic Optimization", ICML 2024.

The prospect-theory borrowing is specific: KTO uses a reference point and a saturating value function. It is not a claim that the Kahneman-Tversky utility curve is the true psychology of how humans judge text.

DPO teaches one powerful measurement: the policy/reference log-ratio $\log(\pi_\theta(y\mid x)/\pi_{\mathrm{ref}}(y\mid x))$ .

That number says how much more the current policy likes a completion than the reference policy did. DPO uses it in pairs: the winner should beat the loser by more than the reference already made it beat the loser.

KTO asks a different question:

Can we use the same policy/reference ratio when the data is only thumbs-up or thumbs-down?

The answer is to compare one labeled output against a reference point. For a desirable output, KTO wants the output's policy/reference log-ratio to rise above the policy's average drift from the reference. For an undesirable output, KTO wants that log-ratio to fall below the same baseline.

The mechanism is:

KTO uses a label-dependent saturating utility of the margin $r_\theta(x,y)-z_0$ , where $r_\theta$ is the policy/reference log-ratio and $z_0$ is a KL-derived reference point.

The saturation matters. Once a desirable example is already far above the baseline, or an undesirable example is already far below it, the gradient shrinks. KTO focuses updates near the boundary $r_\theta(x,y)\approx z_0$ . That can protect against noisy labels, but it can also underfit examples that are hard yet important.

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Equation 1r_\theta(x,y) = \log\frac{\pi_\theta(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)}; z_0 = \mathrm{K...Equation 2\frac{\partial \mathcal L_D}{\partial r_\theta} = -\lambda_D\beta s(1-s); \frac{\partial \mat...

Setup

For a prompt $x$ , completion $y$ , trainable policy $\pi_\theta$ , and reference policy $\pi_{\mathrm{ref}}$ , define the KTO implied reward:

\begin{aligned} r_\theta(x,y) &= \log\frac{\pi_\theta(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)},\\ z_0 &= \mathrm{KL}(\pi_\theta(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)),\\ \delta &= r_\theta(x,y)-z_0,\\ v_D &= \lambda_D\,\sigma(\beta\delta),\\ v_U &= \lambda_U\,\sigma(-\beta\delta),\\ \mathcal L_y &= \lambda_y-v_y. \end{aligned}

With $z_0$ treated as fixed for the update and $s=\sigma(\beta\delta)$ , the label-dependent derivative through $r_\theta$ is

\begin{aligned} \frac{\partial \mathcal L_D}{\partial r_\theta} &= -\lambda_D\beta s(1-s),\\ \frac{\partial \mathcal L_U}{\partial r_\theta} &= \lambda_U\beta s(1-s). \end{aligned}

For an autoregressive language model, both log-probabilities are sequence log-probabilities: sums over the completion tokens.

Each training example has a binary label $\ell\in\{D,U\}$ , where $D$ means desirable and $U$ means undesirable.

The KL reference point

KTO does not compare $y$ to one rejected completion. It compares $y$ to the policy's average reference-relative reward:

z_0 = \mathrm{KL}(\pi_\theta(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)).

Equivalently, this is the current policy's expected implied reward at prompt $x$ :

z_0 = \mathbb E_{y'\sim\pi_\theta(\cdot\mid x)} [r_\theta(x,y')].

The important scalar is

\delta = r_\theta(x,y)-z_0.

A desirable output should have positive $\delta$ . An undesirable output should have negative $\delta$ .

In practical KTO training, $z_0$ is usually estimated from mismatched outputs in the same microbatch, clamped to be nonnegative, and treated as a stop-gradient quantity. It controls loss saturation; it is not itself the thing KTO tries to optimize through.

For a microbatch of size $m\ge 2$ , define $\rho_i$ as the policy/reference log-ratio of a shifted output $y_{j(i)}$ under prompt $x_i$ :

\rho_i = \log \frac{\pi_\theta(y_{j(i)}\mid x_i)} {\pi_{\mathrm{ref}}(y_{j(i)}\mid x_i)}.

A toy version of the paper's reference estimate is then

\hat z_0=\max(0,\operatorname{mean}_i \rho_i).

The point is not to reuse the labeled pair $(x_i,y_i)$ , because those completions were deliberately selected as good or bad and can have unrepresentative rewards.

Label-dependent value and loss

KTO uses a logistic value function. For a desirable example,

v_D = \lambda_D\,\sigma(\beta\delta).

For an undesirable example,

v_U = \lambda_U\,\sigma(-\beta\delta).

The default KTO loss subtracts this value from the label's class weight:

\mathcal L_y = \lambda_y - v_y.

Over the dataset, this is:

\mathcal L_{\mathrm{KTO}} = \mathbb E_{(x,y,\ell)\sim D} [\lambda_\ell-v_\ell(x,y)].

So the two label cases are:

\mathcal L_D = \lambda_D(1-\sigma(\beta\delta)),

and

\mathcal L_U = \lambda_U(1-\sigma(-\beta\delta)).

Gradient descent pushes desirable examples upward in reference-relative reward and undesirable examples downward. If $z_0$ is treated as fixed for the update, let $s=\sigma(\beta\delta)$ . Then

\frac{\partial \mathcal L_D}{\partial r_\theta} = -\lambda_D\beta s(1-s),

while

\frac{\partial \mathcal L_U}{\partial r_\theta} = \lambda_U\beta s(1-s).

Both gradients have their largest magnitude near $\delta=0$ , then saturate.

Here $\beta$ is KTO's value-function saturation parameter. It has a practical anchoring effect similar to DPO's $\beta$ , but in KTO it is introduced directly to control how quickly utility saturates.

The weights $\lambda_D$ and $\lambda_U$ are class/utility weights, not a universal moral setting. They are often chosen to account for the ratio of desirable to undesirable examples. Increasing $\lambda_U$ emphasizes undesirable examples; increasing $\lambda_D$ emphasizes desirable examples.

How this differs from DPO

DPO uses a pairwise margin:

r_\theta(x,y_w)-r_\theta(x,y_\ell).

KTO uses a pointwise margin:

r_\theta(x,y)-z_0.

That is the bridge. DPO asks whether the winner beats the loser by enough. KTO asks whether a single labeled output sits on the correct side of a KL-derived baseline.

Limits of the mechanism

KTO is natural when feedback is already binary or when desirable and undesirable examples are imbalanced. If preference data is clean, consistent, and pairwise, DPO may be the better fit. If labels are noisy or intransitive, KTO's saturation can be useful because extreme examples receive small updates.

When KTO is trained from preference pairs, a common conversion is to treat the preferred response as desirable and the rejected response as undesirable. That is a simplifying assumption, not a law of the method. Naturally binary feedback is the cleaner setting for KTO.

The same property can fail. A hard desirable example with very negative implied reward may be ignored instead of learned. Lower $\beta$ and more epochs can reduce this underfitting risk, but they do not remove it.

A natural next step is Process Reward Models: instead of labeling a whole completion as desirable or undesirable, ask what changes when feedback attaches to intermediate reasoning steps.

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

Code witness 1import numpy as np def sigmoid(x): return 1.0 / (1.0 + np.exp(-x)) # Batch of four labeled pr...python

This small witness mirrors the math: compute the policy/reference log-ratio, estimate a toy KL baseline, apply the label-dependent KTO loss, and inspect the gradient direction.

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

# Batch of four labeled prompt/completion examples.
# Shapes:
#   logp_theta, logp_ref, desirable are all (B,)
#   desirable=True means label D; False means label U.
logp_theta = np.array([-8.2, -5.1, -7.0, -6.4])
logp_ref   = np.array([-8.6, -5.0, -6.1, -6.9])
desirable  = np.array([ True, True, False, False])

beta = 0.2
lambda_D = 1.0
lambda_U = 1.0

# KTO implied reward: policy/reference sequence log-ratio.
r = logp_theta - logp_ref

# Toy microbatch estimate of the KL reference point.
# These are log-probabilities for mismatched pairs (x_i, y_j),
# not the original labeled pairs (x_i, y_i).
mismatch_logp_theta = np.array([-6.0, -7.2, -5.8, -8.0])
mismatch_logp_ref   = np.array([-6.3, -7.6, -6.0, -8.2])
mismatch_r = mismatch_logp_theta - mismatch_logp_ref
z0_hat = max(0.0, float(np.mean(mismatch_r)))

delta = r - z0_hat

value_D = lambda_D * sigmoid(beta * delta)
value_U = lambda_U * sigmoid(-beta * delta)

value = np.where(desirable, value_D, value_U)
lambda_y = np.where(desirable, lambda_D, lambda_U)
loss = lambda_y - value

# Stop-gradient z0: derivative is only through r.
s = sigmoid(beta * delta)
grad_D = -lambda_D * beta * s * (1.0 - s)
grad_U =  lambda_U * beta * s * (1.0 - s)
grad_r = np.where(desirable, grad_D, grad_U)

print("r_theta:", np.round(r, 3))
print("mismatched r for z0:", np.round(mismatch_r, 3))
print("z0_hat:", round(z0_hat, 3))
print("delta = r - z0:", np.round(delta, 3))
print("per-example KTO loss:", np.round(loss, 3))
print("d loss / d r:", np.round(grad_r, 3))
print("mean loss:", round(float(loss.mean()), 3))

assert np.all(grad_r[desirable] < 0)
assert np.all(grad_r[~desirable] > 0)

For desirable examples, the gradient is negative, so gradient descent increases $r_\theta$ . For undesirable examples, the gradient is positive, so gradient descent decreases $r_\theta$ . In both cases, the magnitude is largest near $r_\theta=z_0$ .

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the demo to inspect one labeled example. Change the label, the implied reward $r_\theta$ , the KL baseline $z_0$ , $\beta$ , and the class weights. Watch the KTO loss and gradient flip direction between desirable and undesirable examples, while saturating away from the baseline.

Live Concept Demo

Explore Kahneman-Tversky Optimization

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned

Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Kahneman-Tversky Optimization should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

Prediction open01 / Intuition

Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Kahneman-Tversky Optimization should make visible.

Visual Inquiry

Make the image answer a mathematical question

4/4 stages readyLive demo connected

Visual cueWhich visible object should carry the first intuition?

Inspection depth2/4

Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Kahneman-Tversky Optimization easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

paper · 2024KTO: Model Alignment as Prospect Theoretic OptimizationEthayarajh et al.

Grounds KTO as a HALO objective that learns from binary desirable/undesirable feedback instead of pairwise preferences.

Open source

Claim Review

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources1 reference

ethayarajh-2024-kto

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedKTO trains from binary desirable/undesirable feedback using a label-dependent logistic value of the margin r_theta(x,y)-z0, where r_theta is a policy/reference log-ratio and z0 is KL-derived; desirable updates increase the margin, undesirable updates decrease it, and beta controls saturation.Claim metadata: source checked

Ethayarajh et al. define KTO with binary desirable/undesirable labels, r_theta=log(pi_theta/pi_ref), z0 as a KL reference, and v_D/v_U logistic values containing beta. The page's first two exported math refs now cover that objective plus dL_D/dr_theta<0 and dL_U/dr_theta>0 under stop-gradient z0; code/demo show saturation and directions.

Sources: KTO: Model Alignment as Prospect Theoretic OptimizationCovers only the KTO objective/update mechanics. It does not review claims that KTO outperforms DPO, prospect theory is psychologically true for text, every library implements z0 this way, or that KTO prevents reward hacking or guarantees alignment.A bounded review summary is present; still check caveats and exact source scope.

Ethayarajh et al. support KTO as binary desirable/undesirable optimization with r_theta=log(pi_theta/pi_ref), KL reference z0, logistic label values, and beta saturation. The first two exported equations now witness the objective and stop-gradient derivative signs; code/demo mirror the update directions.

Reviewer: codex+oracle; reviewed 2026-05-07

source-span-ethayarajh-2024-kto math-object-1 math-object-2 code-witness-1 interactive-demo

Source support candidates

paper 2024KTO: Model Alignment as Prospect Theoretic Optimization

Grounds KTO as a HALO objective that learns from binary desirable/undesirable feedback instead of pairwise preferences.

Mechanism witnesses

Equation 1

\begin{aligned} r_\theta(x,y) &= \log\frac{\pi_\theta(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)},\\ z_0 &= \mathrm{KL}(\pi_\theta(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)),\\ \delta &= r_\theta(x,y)-z_0,\\ v_D &= \lambda_D\,\sigma(\beta\delta),\\ v_U &= \lambda_U\,\sigma(-\beta\delta),\\ \mathcal L_y &= \lambda_y-v_y. \end{aligned}

Equation 2

\begin{aligned} \frac{\partial \mathcal L_D}{\partial r_\theta} &= -\lambda_D\beta s(1-s),\\ \frac{\partial \mathcal L_U}{\partial r_\theta} &= \lambda_U\beta s(1-s). \end{aligned}

Code witness 1import numpy as np def sigmoid(x): return 1.0 / (1.0 + np.exp(-x)) # Batch of four labeled pr...Demo stateLive mechanism probe

Practice Loop

Try the idea before it explains itself

Readiness0/3 checks ready

Predict

Before touching the demo, predict one visible change that should happen in Kahneman-Tversky Optimization.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Your answer canvas

Local checks

Claim

A concrete answer is on the canvas.

Mechanism

The answer names why the claim should hold.

Bridge

It touches the page context or a neighboring idea.

Misconception check

Object research drawerClose

ConceptKahneman-Tversky OptimizationAlignment

Code witness comparisonKahneman-Tversky Optimization code witness 1assert np.all(grad_r[desirable] < 0)Prediction before revealKahneman-Tversky Optimization interactive demoManipulate one control and predict the visible change.

Grounded room questionWhat is the smallest example that makes Kahneman-Tversky Optimization click without losing the math?Local snapshot ready

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.

Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptAlignment

Kahneman-Tversky Optimization

Anchored question

What is the smallest example that makes Kahneman-Tversky Optimization click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action

Local action draft

This draft stays locally in this browser for concept:alignment/kto.

Draft noteNext action

No local draft saved.

Evidence to inspect

Source ids to inspect: ethayarajh-2024-kto
Definition, prerequisite, and contrast concept links
The equation or code witness that makes the concept operational
One demo state that shows the invariant instead of a slogan

What would resolve this

The learner can state the mechanism in their own words
The learner can name the prerequisite that would repair confusion
The learner can predict how the mechanism changes under one perturbation

Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Kahneman-Tversky Optimization Object key: concept:alignment/kto Context: Alignment Anchor id: concept/concept-notebook/alignment/kto Open question: What is the smallest example that makes Kahneman-Tversky Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: ethayarajh-2024-kto - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object

concept/concept-notebook/alignment/kto
concept:alignment/kto

Learning Map

Before / Now / Try / Next

BeforeDirect Preference Optimization

NowIntuition → Math → Code → Demo

TryManipulate one control and predict the visible change.

NextProcess Reward Models: Step-Level Verifiers for Reasoning

Intuitionready
Mathready
Codeready
Interactive Demoready

Object Companion

Ask beside the selected object

Your question

GoalComfortStyleStuck on

Context prompt

You are my AI learning companion for Continuous Function. Current context: Alignment concept. Learning surface: Kahneman-Tversky Optimization. What this page says: KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients. Current section: Intuition, math, code, and interactive demo. Suggested next step: Manipulate one control and predict the visible change.. Learner goal: Understand the idea. Learner comfort level: New to this. Preferred explanation style: Visual first. Task: Explain the central idea in plain language, then restate it with the exact math objects from the page. Answer in a way that helps me learn: ask one clarifying question only if needed, use intuition before notation, and end with one thing I should try on the page.

Domain

Alignment

alignmentktobinary-feedbackhalos

Prerequisites

Direct Preference Optimization

Leads To

Process Reward Models: Step-Level Verifiers for Reasoning

Direct Preference Optimization RLHF: Reward Modeling + KL-Regularized Policy Optimization Reward Hacking: Overoptimizing Preference Proxies Test-Time Compute: Spending Inference Budget on Search

Within this domain