25Scaling & Alignment

👍KTO: Alignment from Binary Feedback via Human-Aware Losses

Canonical Papers

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh et al.2024ICML

Read paper →

Binary Classifier Optimization for Large Language Model Alignment

Jung et al.2025ACL

Read paper →

Noise Contrastive Alignment of Language Models with Explicit Rewards

Chen et al.2024arXiv

Read paper →

Core Mathematics

KTO aligns models using binary feedback (desirable/undesirable) instead of pairwise comparisons, framed through prospect theory / human-utility lens (Human-Aware Losses, HALOs).

Implied reward (policy vs reference):

r_\theta(x,y) = \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}

Reference point (baseline) as KL:

z_0 = \text{KL}\!\left(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right)

KTO loss with logistic value function:

\mathcal{L}_{\text{KTO}}(\pi_\theta,\pi_{\text{ref}})=\mathbb{E}_{(x,y)\sim D}\big[\lambda_y - v(x,y)\big]

where

v(x,y)=\begin{cases} \lambda_D\sigma\!\big(\beta(r_\theta(x,y)-z_0)\big) & y \in \text{desirable}\\ \lambda_U\sigma\!\big(\beta(z_0-r_\theta(x,y))\big) & y \in \text{undesirable} \end{cases}

The gradient has a $\sigma(\beta z)(1-\sigma(\beta z))$ factor, so it naturally saturates for extreme $z$ —KTO focuses learning on borderline examples.

Key Equation

v(x,y)=\lambda_D\sigma\!\big(\beta(r_\theta(x,y)-z_0)\big)

Interactive Visualization

Why It Matters for Modern Models

Production feedback is binary (like/dislike, thumbs up/down), not pairwise comparisons—KTO matches real data collection at scale
KTO handles severe class imbalance—analyzed for extreme imbalance where positives are rare, still delivers strong performance
Loss design is inductive bias—KTO teaches that alignment performance swings massively based on objective, not just data
Saturation is robustness—gradients die off for extreme examples, implicitly ignoring too-easy/too-hard/potentially mislabeled feedback
After DPO (#24), KTO shows you don't need pairwise preferences—binary signals + right utility shaping are sufficient

Missing Intuition

What is still poorly explained in textbooks and papers:

Feedback form ≠ training objective—pairwise preference likelihood (DPO) is not the same as maximizing human utility
Reference point does conceptual work—KL baseline makes "just crank up likelihood" ineffective, forcing discriminative learning
Saturation is feature not bug—because gradients die off when implied reward is extreme, KTO implicitly ignores noisy labels
Alignment is loss engineering—KTO + BCO show that objective design matters as much as data quality
Binary feedback is the realistic primitive—like/dislike is cheap and abundant, making KTO the practical alignment method at scale

Connections

Prerequisites

⚖RLHF 🎯DPO

Enables

⚠️Reward Hacking

Next Moves

Explore this concept from different angles — like a mathematician would.

Semantic Connections

🔄 Same Technique

∀Human-aware utility→Theory ◎Implicit value directions→Embeddings 🎲Preference-aware generation→Decoding

🔧 Invented to Fix

⚖Preference pairs→RLHF 🎯No pairs needed→DPO