Alignment

Kahneman-Tversky Optimization

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

status: publishedimportance: importantdifficulty 4/5math: undergraduateread: 18mlive demo
Editorial alignment illustration of pointwise desirable and undesirable feedback shaped by an asymmetric value curve.

Concept Structure

Kahneman-Tversky Optimization

01Intuition

Start with the picture, metaphor, or geometric mechanism.

02Math

Make the objects explicit and connect them with notation.

03Code

Mirror the equations with runnable implementation details.

04Interactive Demo

Manipulate the mechanism and watch the idea respond.

1prerequisites
1next concepts
4related links

Learning map

Kahneman-Tversky Optimization
BeforeDirect Preference OptimizationNow4/4 sections readyTryManipulate one control and predict the visible change.NextProcess Reward Models: Step-Level Verifiers for Reasoning

Object flow

4/4 sections readyAsk about thisResearch room
ConceptKahneman-Tversky OptimizationAlignment
1 source attachedLocal snapshot ready
concept:alignment/kto

Conceptual Bridge

What should feel connected as you move through this page.

Carry inDirect Preference Optimization

Bring the mental model from Direct Preference Optimization; this page will reuse it instead of restarting from zero.

Work hereKahneman-Tversky Optimization

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

Carry outProcess Reward Models: Step-Level Verifiers for Reasoning

The next edge should feel earned: use the demo prediction here before following Process Reward Models: Step-Level Verifiers for Reasoning.

Test the linkManipulate one control and predict the visible change.Then continue to Process Reward Models: Step-Level Verifiers for Reasoning
01

01

Intuition

Build the mental picture first so the rest of the page has something to attach to.

Section prompt

Canonical source: Ethayarajh et al., "KTO: Model Alignment as Prospect Theoretic Optimization", ICML 2024.

The prospect-theory borrowing is specific: KTO uses a reference point and a saturating value function. It is not a claim that the Kahneman-Tversky utility curve is the true psychology of how humans judge text.

DPO teaches one powerful measurement: the policy/reference log-ratio log(πθ(yx)/πref(yx))\log(\pi_\theta(y\mid x)/\pi_{\mathrm{ref}}(y\mid x)).

That number says how much more the current policy likes a completion than the reference policy did. DPO uses it in pairs: the winner should beat the loser by more than the reference already made it beat the loser.

KTO asks a different question:

Can we use the same policy/reference ratio when the data is only thumbs-up or thumbs-down?

The answer is to compare one labeled output against a reference point. For a desirable output, KTO wants the output's policy/reference log-ratio to rise above the policy's average drift from the reference. For an undesirable output, KTO wants that log-ratio to fall below the same baseline.

The mechanism is:

KTO uses a label-dependent saturating utility of the margin rθ(x,y)z0r_\theta(x,y)-z_0, where rθr_\theta is the policy/reference log-ratio and z0z_0 is a KL-derived reference point.

The saturation matters. Once a desirable example is already far above the baseline, or an undesirable example is already far below it, the gradient shrinks. KTO focuses updates near the boundary rθ(x,y)z0r_\theta(x,y)\approx z_0. That can protect against noisy labels, but it can also underfit examples that are hard yet important.

02

02

Math

Translate the story into symbols, assumptions, and a derivation you can inspect.

Section prompt

Setup

For a prompt xx, completion yy, trainable policy πθ\pi_\theta, and reference policy πref\pi_{\mathrm{ref}}, define the KTO implied reward:

rθ(x,y)=logπθ(yx)πref(yx),z0=KL(πθ(x)πref(x)),δ=rθ(x,y)z0,vD=λDσ(βδ),vU=λUσ(βδ),Ly=λyvy.\begin{aligned} r_\theta(x,y) &= \log\frac{\pi_\theta(y\mid x)} {\pi_{\mathrm{ref}}(y\mid x)},\\ z_0 &= \mathrm{KL}(\pi_\theta(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)),\\ \delta &= r_\theta(x,y)-z_0,\\ v_D &= \lambda_D\,\sigma(\beta\delta),\\ v_U &= \lambda_U\,\sigma(-\beta\delta),\\ \mathcal L_y &= \lambda_y-v_y. \end{aligned}

With z0z_0 treated as fixed for the update and s=σ(βδ)s=\sigma(\beta\delta), the label-dependent derivative through rθr_\theta is

LDrθ=λDβs(1s),LUrθ=λUβs(1s).\begin{aligned} \frac{\partial \mathcal L_D}{\partial r_\theta} &= -\lambda_D\beta s(1-s),\\ \frac{\partial \mathcal L_U}{\partial r_\theta} &= \lambda_U\beta s(1-s). \end{aligned}

For an autoregressive language model, both log-probabilities are sequence log-probabilities: sums over the completion tokens.

Each training example has a binary label {D,U}\ell\in\{D,U\}, where DD means desirable and UU means undesirable.

The KL reference point

KTO does not compare yy to one rejected completion. It compares yy to the policy's average reference-relative reward:

z0=KL(πθ(x)πref(x)).z_0 = \mathrm{KL}(\pi_\theta(\cdot\mid x)\|\pi_{\mathrm{ref}}(\cdot\mid x)).

Equivalently, this is the current policy's expected implied reward at prompt xx:

z0=Eyπθ(x)[rθ(x,y)].z_0 = \mathbb E_{y'\sim\pi_\theta(\cdot\mid x)} [r_\theta(x,y')].

The important scalar is

δ=rθ(x,y)z0.\delta = r_\theta(x,y)-z_0.

A desirable output should have positive δ\delta. An undesirable output should have negative δ\delta.

In practical KTO training, z0z_0 is usually estimated from mismatched outputs in the same microbatch, clamped to be nonnegative, and treated as a stop-gradient quantity. It controls loss saturation; it is not itself the thing KTO tries to optimize through.

For a microbatch of size m2m\ge 2, define ρi\rho_i as the policy/reference log-ratio of a shifted output yj(i)y_{j(i)} under prompt xix_i:

ρi=logπθ(yj(i)xi)πref(yj(i)xi).\rho_i = \log \frac{\pi_\theta(y_{j(i)}\mid x_i)} {\pi_{\mathrm{ref}}(y_{j(i)}\mid x_i)}.

A toy version of the paper's reference estimate is then

z^0=max(0,meaniρi).\hat z_0=\max(0,\operatorname{mean}_i \rho_i).

The point is not to reuse the labeled pair (xi,yi)(x_i,y_i), because those completions were deliberately selected as good or bad and can have unrepresentative rewards.

Label-dependent value and loss

KTO uses a logistic value function. For a desirable example,

vD=λDσ(βδ).v_D = \lambda_D\,\sigma(\beta\delta).

For an undesirable example,

vU=λUσ(βδ).v_U = \lambda_U\,\sigma(-\beta\delta).

The default KTO loss subtracts this value from the label's class weight:

Ly=λyvy.\mathcal L_y = \lambda_y - v_y.

Over the dataset, this is:

LKTO=E(x,y,)D[λv(x,y)].\mathcal L_{\mathrm{KTO}} = \mathbb E_{(x,y,\ell)\sim D} [\lambda_\ell-v_\ell(x,y)].

So the two label cases are:

LD=λD(1σ(βδ)),\mathcal L_D = \lambda_D(1-\sigma(\beta\delta)),

and

LU=λU(1σ(βδ)).\mathcal L_U = \lambda_U(1-\sigma(-\beta\delta)).

Gradient descent pushes desirable examples upward in reference-relative reward and undesirable examples downward. If z0z_0 is treated as fixed for the update, let s=σ(βδ)s=\sigma(\beta\delta). Then

LDrθ=λDβs(1s),\frac{\partial \mathcal L_D}{\partial r_\theta} = -\lambda_D\beta s(1-s),

while

LUrθ=λUβs(1s).\frac{\partial \mathcal L_U}{\partial r_\theta} = \lambda_U\beta s(1-s).

Both gradients have their largest magnitude near δ=0\delta=0, then saturate.

Here β\beta is KTO's value-function saturation parameter. It has a practical anchoring effect similar to DPO's β\beta, but in KTO it is introduced directly to control how quickly utility saturates.

The weights λD\lambda_D and λU\lambda_U are class/utility weights, not a universal moral setting. They are often chosen to account for the ratio of desirable to undesirable examples. Increasing λU\lambda_U emphasizes undesirable examples; increasing λD\lambda_D emphasizes desirable examples.

How this differs from DPO

DPO uses a pairwise margin:

rθ(x,yw)rθ(x,y).r_\theta(x,y_w)-r_\theta(x,y_\ell).

KTO uses a pointwise margin:

rθ(x,y)z0.r_\theta(x,y)-z_0.

That is the bridge. DPO asks whether the winner beats the loser by enough. KTO asks whether a single labeled output sits on the correct side of a KL-derived baseline.

Limits of the mechanism

KTO is natural when feedback is already binary or when desirable and undesirable examples are imbalanced. If preference data is clean, consistent, and pairwise, DPO may be the better fit. If labels are noisy or intransitive, KTO's saturation can be useful because extreme examples receive small updates.

When KTO is trained from preference pairs, a common conversion is to treat the preferred response as desirable and the rejected response as undesirable. That is a simplifying assumption, not a law of the method. Naturally binary feedback is the cleaner setting for KTO.

The same property can fail. A hard desirable example with very negative implied reward may be ignored instead of learned. Lower β\beta and more epochs can reduce this underfitting risk, but they do not remove it.

A natural next step is Process Reward Models: instead of labeling a whole completion as desirable or undesirable, ask what changes when feedback attaches to intermediate reasoning steps.

03

03

Code

Keep the implementation aligned with the notation so the algorithm is legible.

Section prompt

This small witness mirrors the math: compute the policy/reference log-ratio, estimate a toy KL baseline, apply the label-dependent KTO loss, and inspect the gradient direction.

import numpy as np

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

# Batch of four labeled prompt/completion examples.
# Shapes:
#   logp_theta, logp_ref, desirable are all (B,)
#   desirable=True means label D; False means label U.
logp_theta = np.array([-8.2, -5.1, -7.0, -6.4])
logp_ref   = np.array([-8.6, -5.0, -6.1, -6.9])
desirable  = np.array([ True, True, False, False])

beta = 0.2
lambda_D = 1.0
lambda_U = 1.0

# KTO implied reward: policy/reference sequence log-ratio.
r = logp_theta - logp_ref

# Toy microbatch estimate of the KL reference point.
# These are log-probabilities for mismatched pairs (x_i, y_j),
# not the original labeled pairs (x_i, y_i).
mismatch_logp_theta = np.array([-6.0, -7.2, -5.8, -8.0])
mismatch_logp_ref   = np.array([-6.3, -7.6, -6.0, -8.2])
mismatch_r = mismatch_logp_theta - mismatch_logp_ref
z0_hat = max(0.0, float(np.mean(mismatch_r)))

delta = r - z0_hat

value_D = lambda_D * sigmoid(beta * delta)
value_U = lambda_U * sigmoid(-beta * delta)

value = np.where(desirable, value_D, value_U)
lambda_y = np.where(desirable, lambda_D, lambda_U)
loss = lambda_y - value

# Stop-gradient z0: derivative is only through r.
s = sigmoid(beta * delta)
grad_D = -lambda_D * beta * s * (1.0 - s)
grad_U =  lambda_U * beta * s * (1.0 - s)
grad_r = np.where(desirable, grad_D, grad_U)

print("r_theta:", np.round(r, 3))
print("mismatched r for z0:", np.round(mismatch_r, 3))
print("z0_hat:", round(z0_hat, 3))
print("delta = r - z0:", np.round(delta, 3))
print("per-example KTO loss:", np.round(loss, 3))
print("d loss / d r:", np.round(grad_r, 3))
print("mean loss:", round(float(loss.mean()), 3))

assert np.all(grad_r[desirable] < 0)
assert np.all(grad_r[~desirable] > 0)

For desirable examples, the gradient is negative, so gradient descent increases rθr_\theta. For undesirable examples, the gradient is positive, so gradient descent decreases rθr_\theta. In both cases, the magnitude is largest near rθ=z0r_\theta=z_0.

04

04

Interactive Demo

Use direct manipulation to connect the explanation to a moving system.

Section prompt

Use the demo to inspect one labeled example. Change the label, the implied reward rθr_\theta, the KL baseline z0z_0, β\beta, and the class weights. Watch the KTO loss and gradient flip direction between desirable and undesirable examples, while saturating away from the baseline.

Live Concept Demo

Explore Kahneman-Tversky Optimization

The stage is code-native and interactive. Use it to test the explanation against the mechanism.

difficulty 4/5undergraduatecode-aligned
Demo Prediction Checkpoint

Manipulate one control and predict the visible change.

Commit to what Kahneman-Tversky Optimization should make visible before reading the result.

After The First Pass

Turn the concept into an inspected object.

Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.

Mechanism Storyboard

See the idea move before the page explains it

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

Prediction open01 / Intuition
Editorial alignment illustration of pointwise desirable and undesirable feedback shaped by an asymmetric value curve.
Prediction lens

Start with the picture, metaphor, or geometric mechanism.

Commit first

Before reading further, choose the kind of change Kahneman-Tversky Optimization should make visible.

Visual Inquiry

Make the image answer a mathematical question

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

4/4 stages readyLive demo connected
Prediction

Which visible object should carry the first intuition?

Commit first

Pick the cue that should make Kahneman-Tversky Optimization easier to reason about before the page gives the answer.

Source Grounding

Canonical references for the mechanism on this page.

paper · 2024KTO: Model Alignment as Prospect Theoretic OptimizationEthayarajh et al.

Grounds KTO as a HALO objective that learns from binary desirable/undesirable feedback instead of pairwise preferences.

Open source

Claim Review

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

Status1 substantive review recorded

Claims without a substantive review badge still need exact source-support review.

Sources1 reference

ethayarajh-2024-kto

Witnesses4 local objects

Use equation, code, and demo objects to check whether the source support is operational.

Substantively reviewedKTO trains from binary desirable/undesirable feedback using a label-dependent logistic value of the margin r_theta(x,y)-z0, where r_theta is a policy/reference log-ratio and z0 is KL-derived; desirable updates increase the margin, undesirable updates decrease it, and beta controls saturation.Claim metadata: source checked

Ethayarajh et al. define KTO with binary desirable/undesirable labels, r_theta=log(pi_theta/pi_ref), z0 as a KL reference, and v_D/v_U logistic values containing beta. The page's first two exported math refs now cover that objective plus dL_D/dr_theta<0 and dL_U/dr_theta>0 under stop-gradient z0; code/demo show saturation and directions.

Sources: KTO: Model Alignment as Prospect Theoretic OptimizationCovers only the KTO objective/update mechanics. It does not review claims that KTO outperforms DPO, prospect theory is psychologically true for text, every library implements z0 this way, or that KTO prevents reward hacking or guarantees alignment.A bounded review summary is present; still check caveats and exact source scope.

Ethayarajh et al. support KTO as binary desirable/undesirable optimization with r_theta=log(pi_theta/pi_ref), KL reference z0, logistic label values, and beta saturation. The first two exported equations now witness the objective and stop-gradient derivative signs; code/demo mirror the update directions.

Reviewer: codex+oracle; reviewed 2026-05-07

Practice Loop

Try the idea before it explains itself

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

Readiness0/3 checks ready
Predict

Before touching the demo, predict one visible change that should happen in Kahneman-Tversky Optimization.

Hint 1

Reveal when your model needs a nudge.

Hint 2

Reveal when your model needs a nudge.

Hint 3

Reveal when your model needs a nudge.

Object research drawerClose
ConceptKahneman-Tversky OptimizationAlignment

Research Room

Attach the question to an exact object

Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.
Next local actionNo local draft saved yet

Open the draft below to save one note and next action in this browser.

conceptAlignment

Kahneman-Tversky Optimization

Anchored question

What is the smallest example that makes Kahneman-Tversky Optimization click without losing the math?

Local action draftNo local draft saved yetExpand only when ready to capture one local next action
Local action draft

This draft stays locally in this browser for concept:alignment/kto.

No local draft saved.
Evidence to inspect
  • Source ids to inspect: ethayarajh-2024-kto
  • Definition, prerequisite, and contrast concept links
  • The equation or code witness that makes the concept operational
  • One demo state that shows the invariant instead of a slogan
What would resolve this
  • The learner can state the mechanism in their own words
  • The learner can name the prerequisite that would repair confusion
  • The learner can predict how the mechanism changes under one perturbation
Grounded AI handoff

I am working in Continuous Function's research reading room. Object: concept - Kahneman-Tversky Optimization Object key: concept:alignment/kto Context: Alignment Anchor id: concept/concept-notebook/alignment/kto Open question: What is the smallest example that makes Kahneman-Tversky Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: ethayarajh-2024-kto - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.

Open source object
concept/concept-notebook/alignment/kto concept:alignment/kto