Legacy Concept Lab

PPO: Proximal Policy Optimization

PPO is THE algorithm behind RLHF—understanding it explains how preference data becomes model behavior

Concept 40 of 100Scaling & AlignmentPhase 7

#40PPOScaling & Alignment

key equation

L^{CLIP}(\theta) = \mathbb{E}\left[ \min\left( r_t \hat{A}_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

Phase 7: Alignment & RLHFConcept 40 of 100

Why It Matters for Modern Models

PPO is THE algorithm behind RLHF—understanding it explains how preference data becomes model behavior
Clipping ratio is a practical trust region: prevents catastrophic forgetting while allowing learning
GAE balances bias/variance in advantage estimation—key hyperparameter for stable RLHF training

What is still poorly explained in textbooks and papers:

Why clipping not KL penalty: PPO was simpler to tune than TRPO and empirically as effective
The "probability ratio" view: you are reweighting old experience by how much more/less likely actions are now
PPO failures in RLHF often trace to advantage estimation issues—reward model noise amplifies errors

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

L^{CLIP}(\theta) = \mathbb{E}\left[ \min\left( r_t \hat{A}_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

PPO optimizes policies with clipped surrogate objectives:

L^{CLIP}(\theta) = \mathbb{E}_t\left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

where the probability ratio is:

r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}

Advantage estimation (GAE):

\hat{A}_t = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

The clipping prevents too-large policy updates that destabilize training.

Schulman et al.2017arXiv

Explore this concept from different angles — like a mathematician would.