Domain Neighborhood

Alignment

How we shape model behavior: preference learning, reward modeling, KL-regularized fine-tuning, and the failure modes that appear when you optimize the wrong thing.

5 concepts5 published5 demos

Start with Kahneman-Tversky Optimization Search Atlas

Recommended Route

Start here, then follow the prerequisites forward.

This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.

01
Kahneman-Tversky Optimization
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.
18 mincodedemoafter Direct Preference Optimization
Check Direct Preference Optimization first if the symbols feel slippery.
02
Reward Hacking: Overoptimizing Preference Proxies
When an imperfect preference proxy is optimized past its validation regime, policy mass shifts toward reward-model errors; KL, ensembles, LCBs, and monitoring slow this down but do not make the proxy true.
22 mincodedemoafter RLHF: Reward Modeling + KL-Regularized Policy Optimization
Why this follows: both pages keep the alignment thread active.
03
Direct Preference Optimization
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.
18 mincodedemoafter Cross-Entropy, KL Divergence (Relative Entropy), RLHF: Reward Modeling + KL-Regularized Policy Optimization
Why this follows: both pages keep the alignment thread active.
04
RLHF: Reward Modeling + KL-Regularized Policy Optimization
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.
20 mincodedemoafter Maximum Likelihood, Cross-Entropy, KL Divergence (Relative Entropy)
Why this follows: both pages keep the alignment / preferences thread active.
05
Process Reward Models: Step-Level Verifiers for Reasoning
A process reward model scores intermediate reasoning steps instead of only terminal answers, giving denser verifier feedback for reranking and search while remaining a learned proxy.
24 mincodedemoafter RLHF: Reward Modeling + KL-Regularized Policy Optimization, Reward Hacking: Overoptimizing Preference Proxies, Cross-Entropy
Why this follows: Process Reward Models: Step-Level Verifiers for Reasoning uses RLHF: Reward Modeling + KL-Regularized Policy Optimization directly.

All Published Notebooks

Browse the territory.

Kahneman-Tversky Optimization

KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

Level 418 mindemo

Reward Hacking: Overoptimizing Preference Proxies

When an imperfect preference proxy is optimized past its validation regime, policy mass shifts toward reward-model errors; KL, ensembles, LCBs, and monitoring slow this down but do not make the proxy true.

Level 422 mindemo

Direct Preference Optimization

DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

Level 418 mindemo

RLHF: Reward Modeling + KL-Regularized Policy Optimization

RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

Level 420 mindemo

Process Reward Models: Step-Level Verifiers for Reasoning

A process reward model scores intermediate reasoning steps instead of only terminal answers, giving denser verifier feedback for reranking and search while remaining a learned proxy.

Level 424 mindemo

Advanced Bridges

Use these after the core path.

Kahneman-Tversky Optimization Reward Hacking: Overoptimizing Preference Proxies Direct Preference Optimization RLHF: Reward Modeling + KL-Regularized Policy Optimization Process Reward Models: Step-Level Verifiers for Reasoning