25Scaling & Alignment

👍KTO: Alignment from Binary Feedback via Human-Aware Losses

Canonical Papers

KTO: Model Alignment as Prospect Theoretic Optimization

Ethayarajh et al.2024ICML
Read paper →

Binary Classifier Optimization for Large Language Model Alignment

Jung et al.2025ACL
Read paper →

Noise Contrastive Alignment of Language Models with Explicit Rewards

Chen et al.2024arXiv
Read paper →

Core Mathematics

KTO aligns models using binary feedback (desirable/undesirable) instead of pairwise comparisons, framed through prospect theory / human-utility lens (Human-Aware Losses, HALOs).

Implied reward (policy vs reference):

rθ(x,y)=logπθ(yx)πref(yx)r_\theta(x,y) = \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}

Reference point (baseline) as KL:

z0=KL ⁣(πθ(x)πref(x))z_0 = \text{KL}\!\left(\pi_\theta(\cdot|x) \| \pi_{\text{ref}}(\cdot|x)\right)

KTO loss with logistic value function:

LKTO(πθ,πref)=E(x,y)D[λyv(x,y)]\mathcal{L}_{\text{KTO}}(\pi_\theta,\pi_{\text{ref}})=\mathbb{E}_{(x,y)\sim D}\big[\lambda_y - v(x,y)\big]

where

v(x,y)={λDσ ⁣(β(rθ(x,y)z0))ydesirableλUσ ⁣(β(z0rθ(x,y)))yundesirablev(x,y)=\begin{cases} \lambda_D\sigma\!\big(\beta(r_\theta(x,y)-z_0)\big) & y \in \text{desirable}\\ \lambda_U\sigma\!\big(\beta(z_0-r_\theta(x,y))\big) & y \in \text{undesirable} \end{cases}

The gradient has a σ(βz)(1σ(βz))\sigma(\beta z)(1-\sigma(\beta z)) factor, so it naturally saturates for extreme zz—KTO focuses learning on borderline examples.

Key Equation
v(x,y)=λDσ ⁣(β(rθ(x,y)z0))v(x,y)=\lambda_D\sigma\!\big(\beta(r_\theta(x,y)-z_0)\big)

Interactive Visualization

Why It Matters for Modern Models

  • Production feedback is binary (like/dislike, thumbs up/down), not pairwise comparisons—KTO matches real data collection at scale
  • KTO handles severe class imbalance—analyzed for extreme imbalance where positives are rare, still delivers strong performance
  • Loss design is inductive bias—KTO teaches that alignment performance swings massively based on objective, not just data
  • Saturation is robustness—gradients die off for extreme examples, implicitly ignoring too-easy/too-hard/potentially mislabeled feedback
  • After DPO (#24), KTO shows you don't need pairwise preferences—binary signals + right utility shaping are sufficient

Missing Intuition

What is still poorly explained in textbooks and papers:

  • Feedback form ≠ training objective—pairwise preference likelihood (DPO) is not the same as maximizing human utility
  • Reference point does conceptual work—KL baseline makes "just crank up likelihood" ineffective, forcing discriminative learning
  • Saturation is feature not bug—because gradients die off when implied reward is extreme, KTO implicitly ignores noisy labels
  • Alignment is loss engineering—KTO + BCO show that objective design matters as much as data quality
  • Binary feedback is the realistic primitive—like/dislike is cheap and abundant, making KTO the practical alignment method at scale

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.