Bring the mental model from Direct Preference Optimization; this page will reuse it instead of restarting from zero.
Alignment
Kahneman-Tversky Optimization
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

Concept Structure
Kahneman-Tversky Optimization
Start with the picture, metaphor, or geometric mechanism.
Make the objects explicit and connect them with notation.
Mirror the equations with runnable implementation details.
Manipulate the mechanism and watch the idea respond.
Learning map
Kahneman-Tversky OptimizationConceptual Bridge
What should feel connected as you move through this page.
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.
The next edge should feel earned: use the demo prediction here before following Process Reward Models: Step-Level Verifiers for Reasoning.
01
Intuition
Build the mental picture first so the rest of the page has something to attach to.
Canonical source: Ethayarajh et al., "KTO: Model Alignment as Prospect Theoretic Optimization", ICML 2024.
The prospect-theory borrowing is specific: KTO uses a reference point and a saturating value function. It is not a claim that the Kahneman-Tversky utility curve is the true psychology of how humans judge text.
DPO teaches one powerful measurement: the policy/reference log-ratio .
That number says how much more the current policy likes a completion than the reference policy did. DPO uses it in pairs: the winner should beat the loser by more than the reference already made it beat the loser.
KTO asks a different question:
Can we use the same policy/reference ratio when the data is only thumbs-up or thumbs-down?
The answer is to compare one labeled output against a reference point. For a desirable output, KTO wants the output's policy/reference log-ratio to rise above the policy's average drift from the reference. For an undesirable output, KTO wants that log-ratio to fall below the same baseline.
The mechanism is:
KTO uses a label-dependent saturating utility of the margin , where is the policy/reference log-ratio and is a KL-derived reference point.
The saturation matters. Once a desirable example is already far above the baseline, or an undesirable example is already far below it, the gradient shrinks. KTO focuses updates near the boundary . That can protect against noisy labels, but it can also underfit examples that are hard yet important.
02
Math
Translate the story into symbols, assumptions, and a derivation you can inspect.
Setup
For a prompt , completion , trainable policy , and reference policy , define the KTO implied reward:
With treated as fixed for the update and , the label-dependent derivative through is
For an autoregressive language model, both log-probabilities are sequence log-probabilities: sums over the completion tokens.
Each training example has a binary label , where means desirable and means undesirable.
The KL reference point
KTO does not compare to one rejected completion. It compares to the policy's average reference-relative reward:
Equivalently, this is the current policy's expected implied reward at prompt :
The important scalar is
A desirable output should have positive . An undesirable output should have negative .
In practical KTO training, is usually estimated from mismatched outputs in the same microbatch, clamped to be nonnegative, and treated as a stop-gradient quantity. It controls loss saturation; it is not itself the thing KTO tries to optimize through.
For a microbatch of size , define as the policy/reference log-ratio of a shifted output under prompt :
A toy version of the paper's reference estimate is then
The point is not to reuse the labeled pair , because those completions were deliberately selected as good or bad and can have unrepresentative rewards.
Label-dependent value and loss
KTO uses a logistic value function. For a desirable example,
For an undesirable example,
The default KTO loss subtracts this value from the label's class weight:
Over the dataset, this is:
So the two label cases are:
and
Gradient descent pushes desirable examples upward in reference-relative reward and undesirable examples downward. If is treated as fixed for the update, let . Then
while
Both gradients have their largest magnitude near , then saturate.
Here is KTO's value-function saturation parameter. It has a practical anchoring effect similar to DPO's , but in KTO it is introduced directly to control how quickly utility saturates.
The weights and are class/utility weights, not a universal moral setting. They are often chosen to account for the ratio of desirable to undesirable examples. Increasing emphasizes undesirable examples; increasing emphasizes desirable examples.
How this differs from DPO
DPO uses a pairwise margin:
KTO uses a pointwise margin:
That is the bridge. DPO asks whether the winner beats the loser by enough. KTO asks whether a single labeled output sits on the correct side of a KL-derived baseline.
Limits of the mechanism
KTO is natural when feedback is already binary or when desirable and undesirable examples are imbalanced. If preference data is clean, consistent, and pairwise, DPO may be the better fit. If labels are noisy or intransitive, KTO's saturation can be useful because extreme examples receive small updates.
When KTO is trained from preference pairs, a common conversion is to treat the preferred response as desirable and the rejected response as undesirable. That is a simplifying assumption, not a law of the method. Naturally binary feedback is the cleaner setting for KTO.
The same property can fail. A hard desirable example with very negative implied reward may be ignored instead of learned. Lower and more epochs can reduce this underfitting risk, but they do not remove it.
A natural next step is Process Reward Models: instead of labeling a whole completion as desirable or undesirable, ask what changes when feedback attaches to intermediate reasoning steps.
03
Code
Keep the implementation aligned with the notation so the algorithm is legible.
This small witness mirrors the math: compute the policy/reference log-ratio, estimate a toy KL baseline, apply the label-dependent KTO loss, and inspect the gradient direction.
import numpy as np
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
# Batch of four labeled prompt/completion examples.
# Shapes:
# logp_theta, logp_ref, desirable are all (B,)
# desirable=True means label D; False means label U.
logp_theta = np.array([-8.2, -5.1, -7.0, -6.4])
logp_ref = np.array([-8.6, -5.0, -6.1, -6.9])
desirable = np.array([ True, True, False, False])
beta = 0.2
lambda_D = 1.0
lambda_U = 1.0
# KTO implied reward: policy/reference sequence log-ratio.
r = logp_theta - logp_ref
# Toy microbatch estimate of the KL reference point.
# These are log-probabilities for mismatched pairs (x_i, y_j),
# not the original labeled pairs (x_i, y_i).
mismatch_logp_theta = np.array([-6.0, -7.2, -5.8, -8.0])
mismatch_logp_ref = np.array([-6.3, -7.6, -6.0, -8.2])
mismatch_r = mismatch_logp_theta - mismatch_logp_ref
z0_hat = max(0.0, float(np.mean(mismatch_r)))
delta = r - z0_hat
value_D = lambda_D * sigmoid(beta * delta)
value_U = lambda_U * sigmoid(-beta * delta)
value = np.where(desirable, value_D, value_U)
lambda_y = np.where(desirable, lambda_D, lambda_U)
loss = lambda_y - value
# Stop-gradient z0: derivative is only through r.
s = sigmoid(beta * delta)
grad_D = -lambda_D * beta * s * (1.0 - s)
grad_U = lambda_U * beta * s * (1.0 - s)
grad_r = np.where(desirable, grad_D, grad_U)
print("r_theta:", np.round(r, 3))
print("mismatched r for z0:", np.round(mismatch_r, 3))
print("z0_hat:", round(z0_hat, 3))
print("delta = r - z0:", np.round(delta, 3))
print("per-example KTO loss:", np.round(loss, 3))
print("d loss / d r:", np.round(grad_r, 3))
print("mean loss:", round(float(loss.mean()), 3))
assert np.all(grad_r[desirable] < 0)
assert np.all(grad_r[~desirable] > 0)
For desirable examples, the gradient is negative, so gradient descent increases . For undesirable examples, the gradient is positive, so gradient descent decreases . In both cases, the magnitude is largest near .
04
Interactive Demo
Use direct manipulation to connect the explanation to a moving system.
Use the demo to inspect one labeled example. Change the label, the implied reward , the KL baseline , , and the class weights. Watch the KTO loss and gradient flip direction between desirable and undesirable examples, while saturating away from the baseline.
Live Concept Demo
Explore Kahneman-Tversky Optimization
The stage is code-native and interactive. Use it to test the explanation against the mechanism.
Manipulate one control and predict the visible change.
Commit to what Kahneman-Tversky Optimization should make visible before reading the result.
After The First Pass
Turn the concept into an inspected object.
Once the invariant is visible in the intuition, math, code, and demo, use these panels to inspect the mechanism visually, check source support, practice the idea, and attach a grounded research question.
Mechanism Storyboard
See the idea move before the page explains it
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

Start with the picture, metaphor, or geometric mechanism.
Before reading further, choose the kind of change Kahneman-Tversky Optimization should make visible.
Visual Inquiry
Make the image answer a mathematical question
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.
Which visible object should carry the first intuition?
Pick the cue that should make Kahneman-Tversky Optimization easier to reason about before the page gives the answer.
Source Grounding
Canonical references for the mechanism on this page.
Grounds KTO as a HALO objective that learns from binary desirable/undesirable feedback instead of pairwise preferences.
Open sourceClaim Review
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.
Claims without a substantive review badge still need exact source-support review.
ethayarajh-2024-kto
Use equation, code, and demo objects to check whether the source support is operational.
Ethayarajh et al. define KTO with binary desirable/undesirable labels, r_theta=log(pi_theta/pi_ref), z0 as a KL reference, and v_D/v_U logistic values containing beta. The page's first two exported math refs now cover that objective plus dL_D/dr_theta<0 and dL_U/dr_theta>0 under stop-gradient z0; code/demo show saturation and directions.
Sources: KTO: Model Alignment as Prospect Theoretic OptimizationCovers only the KTO objective/update mechanics. It does not review claims that KTO outperforms DPO, prospect theory is psychologically true for text, every library implements z0 this way, or that KTO prevents reward hacking or guarantees alignment.A bounded review summary is present; still check caveats and exact source scope.Ethayarajh et al. support KTO as binary desirable/undesirable optimization with r_theta=log(pi_theta/pi_ref), KL reference z0, logistic label values, and beta saturation. The first two exported equations now witness the objective and stop-gradient derivative signs; code/demo mirror the update directions.
Reviewer: codex+oracle; reviewed 2026-05-07Practice Loop
Try the idea before it explains itself
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.
Before touching the demo, predict one visible change that should happen in Kahneman-Tversky Optimization.
Reveal when your model needs a nudge.
Reveal when your model needs a nudge.
Reveal when your model needs a nudge.
A concrete answer is on the canvas.
The answer names why the claim should hold.
It touches the page context or a neighboring idea.
Research Room
Attach the question to an exact object
Pick the concept, equation, source, code witness, claim, misconception, or demo state before asking for help. The handoff stays grounded to that object.Open the draft below to save one note and next action in this browser.
Kahneman-Tversky Optimization
What is the smallest example that makes Kahneman-Tversky Optimization click without losing the math?
Local action draftNo local draft saved yetExpand only when ready to capture one local next action
This draft stays locally in this browser for concept:alignment/kto.
- Source ids to inspect: ethayarajh-2024-kto
- Definition, prerequisite, and contrast concept links
- The equation or code witness that makes the concept operational
- One demo state that shows the invariant instead of a slogan
- The learner can state the mechanism in their own words
- The learner can name the prerequisite that would repair confusion
- The learner can predict how the mechanism changes under one perturbation
I am working in Continuous Function's research reading room. Object: concept - Kahneman-Tversky Optimization Object key: concept:alignment/kto Context: Alignment Anchor id: concept/concept-notebook/alignment/kto Open question: What is the smallest example that makes Kahneman-Tversky Optimization click without losing the math? Evidence to inspect: - Source ids to inspect: ethayarajh-2024-kto - Definition, prerequisite, and contrast concept links - The equation or code witness that makes the concept operational - One demo state that shows the invariant instead of a slogan What would resolve this: - The learner can state the mechanism in their own words - The learner can name the prerequisite that would repair confusion - The learner can predict how the mechanism changes under one perturbation Answer as a careful research tutor: stay source-grounded, separate verified evidence from assumptions, name the relevant math objects, and end with one next action.
concept/concept-notebook/alignment/kto
concept:alignment/kto