Domain Neighborhood
Alignment
How we shape model behavior: preference learning, reward modeling, KL-regularized fine-tuning, and the failure modes that appear when you optimize the wrong thing.
Recommended Route
Start here, then follow the prerequisites forward.
This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.
- 01Kahneman-Tversky Optimization
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.
18 mincodedemoafter Direct Preference OptimizationCheck Direct Preference Optimization first if the symbols feel slippery.
- 02Reward Hacking: Overoptimizing Preference Proxies
When an imperfect preference proxy is optimized past its validation regime, policy mass shifts toward reward-model errors; KL, ensembles, LCBs, and monitoring slow this down but do not make the proxy true.
22 mincodedemoafter RLHF: Reward Modeling + KL-Regularized Policy OptimizationWhy this follows: both pages keep the alignment thread active.
- 03Direct Preference Optimization
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.
18 mincodedemoafter Cross-Entropy, KL Divergence (Relative Entropy), RLHF: Reward Modeling + KL-Regularized Policy OptimizationWhy this follows: both pages keep the alignment thread active.
- 04RLHF: Reward Modeling + KL-Regularized Policy Optimization
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.
20 mincodedemoafter Maximum Likelihood, Cross-Entropy, KL Divergence (Relative Entropy)Why this follows: both pages keep the alignment / preferences thread active.
- 05Process Reward Models: Step-Level Verifiers for Reasoning
A process reward model scores intermediate reasoning steps instead of only terminal answers, giving denser verifier feedback for reranking and search while remaining a learned proxy.
24 mincodedemoafter RLHF: Reward Modeling + KL-Regularized Policy Optimization, Reward Hacking: Overoptimizing Preference Proxies, Cross-EntropyWhy this follows: Process Reward Models: Step-Level Verifiers for Reasoning uses RLHF: Reward Modeling + KL-Regularized Policy Optimization directly.
All Published Notebooks
Browse the territory.
Kahneman-Tversky Optimization
KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.
Reward Hacking: Overoptimizing Preference Proxies
When an imperfect preference proxy is optimized past its validation regime, policy mass shifts toward reward-model errors; KL, ensembles, LCBs, and monitoring slow this down but do not make the proxy true.
Direct Preference Optimization
DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.
RLHF: Reward Modeling + KL-Regularized Policy Optimization
RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.
Process Reward Models: Step-Level Verifiers for Reasoning
A process reward model scores intermediate reasoning steps instead of only terminal answers, giving denser verifier feedback for reranking and search while remaining a learned proxy.
Advanced Bridges