Preference-Based Alignment: RLHF, Reward Modeling, Constitutional AI
Canonical Papers
Deep Reinforcement Learning from Human Preferences
Read paper →Training Language Models to Follow Instructions with Human Feedback
Read paper →Constitutional AI: Harmlessness from AI Feedback
Read paper →Core Mathematics
Reward modeling from preferences: Given human comparisons between outputs , learn reward model via Bradley–Terry:
RLHF objective: Fine-tune policy to maximize reward while staying close to reference model :
Constitutional AI: "labeler" is another model guided by a constitution (natural-language principles).
Key Equation
Interactive Visualization
Why It Matters for Modern Models
- GPT-4, Claude-3, Gemini rely on RLHF-style procedures to be helpful, honest, harmless
- Constitutional AI ideas are key to Anthropic's Claude models
Missing Intuition
What is still poorly explained in textbooks and papers:
- Conceptual explanation of RLHF as a KL-regularized Bayesian update on behavior
- How over-optimization of learned reward leads to reward hacking and distribution shift
- Interactive visualizations of policy distributions before/after RLHF