Legacy Concept Lab
PPO: Proximal Policy Optimization
PPO is THE algorithm behind RLHF—understanding it explains how preference data becomes model behavior
#40PPOScaling & Alignment
key equation
L^{CLIP}(\theta) = \mathbb{E}\left[ \min\left( r_t \hat{A}_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]Phase 7: Alignment & RLHFConcept 40 of 100
Why It Matters for Modern Models
- PPO is THE algorithm behind RLHF—understanding it explains how preference data becomes model behavior
- Clipping ratio is a practical trust region: prevents catastrophic forgetting while allowing learning
- GAE balances bias/variance in advantage estimation—key hyperparameter for stable RLHF training
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Why clipping not KL penalty: PPO was simpler to tune than TRPO and empirically as effective
- The "probability ratio" view: you are reweighting old experience by how much more/less likely actions are now
- PPO failures in RLHF often trace to advantage estimation issues—reward model noise amplifies errors
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
PPO optimizes policies with clipped surrogate objectives:
where the probability ratio is:
Advantage estimation (GAE):
The clipping prevents too-large policy updates that destabilize training.