Domain Neighborhood

Alignment

How we shape model behavior: preference learning, reward modeling, KL-regularized fine-tuning, and the failure modes that appear when you optimize the wrong thing.

5 concepts5 published5 demos

Recommended Route

This sequence is ordered for learning rather than inventory: lower difficulty, fewer prerequisites, and more central concepts come first.

  1. 01
    Kahneman-Tversky Optimization

    KTO turns binary desirable/undesirable labels into a reference-relative utility loss: push a labeled output's policy/reference log-ratio above or below a KL-derived baseline, with saturating gradients.

    18 mincodedemoafter Direct Preference Optimization

    Check Direct Preference Optimization first if the symbols feel slippery.

  2. 02
    Reward Hacking: Overoptimizing Preference Proxies

    When an imperfect preference proxy is optimized past its validation regime, policy mass shifts toward reward-model errors; KL, ensembles, LCBs, and monitoring slow this down but do not make the proxy true.

    22 mincodedemoafter RLHF: Reward Modeling + KL-Regularized Policy Optimization

    Why this follows: both pages keep the alignment thread active.

  3. 03
    Direct Preference Optimization

    DPO turns pairwise preferences into binary cross-entropy on reference-relative log odds, using the KL-regularized RLHF optimum to make the policy itself an implicit reward model.

    18 mincodedemoafter Cross-Entropy, KL Divergence (Relative Entropy), RLHF: Reward Modeling + KL-Regularized Policy Optimization

    Why this follows: both pages keep the alignment thread active.

  4. 04
    RLHF: Reward Modeling + KL-Regularized Policy Optimization

    RLHF trains a reward model from pairwise preferences, then reweights a reference policy toward high learned reward while a KL penalty limits distribution shift.

    20 mincodedemoafter Maximum Likelihood, Cross-Entropy, KL Divergence (Relative Entropy)

    Why this follows: both pages keep the alignment / preferences thread active.

  5. 05
    Process Reward Models: Step-Level Verifiers for Reasoning

    A process reward model scores intermediate reasoning steps instead of only terminal answers, giving denser verifier feedback for reranking and search while remaining a learned proxy.

    24 mincodedemoafter RLHF: Reward Modeling + KL-Regularized Policy Optimization, Reward Hacking: Overoptimizing Preference Proxies, Cross-Entropy

    Why this follows: Process Reward Models: Step-Level Verifiers for Reasoning uses RLHF: Reward Modeling + KL-Regularized Policy Optimization directly.

All Published Notebooks

Browse the territory.

Advanced Bridges

Use these after the core path.