Legacy Concept Lab

PPO: Proximal Policy Optimization

PPO is THE algorithm behind RLHF—understanding it explains how preference data becomes model behavior

Concept 40 of 100Scaling & AlignmentPhase 7
#40PPOScaling & Alignment
key equationL^{CLIP}(\theta) = \mathbb{E}\left[ \min\left( r_t \hat{A}_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]
Phase 7: Alignment & RLHFConcept 40 of 100

Why It Matters for Modern Models

  • PPO is THE algorithm behind RLHF—understanding it explains how preference data becomes model behavior
  • Clipping ratio is a practical trust region: prevents catastrophic forgetting while allowing learning
  • GAE balances bias/variance in advantage estimation—key hyperparameter for stable RLHF training

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Why clipping not KL penalty: PPO was simpler to tune than TRPO and empirically as effective
  • The "probability ratio" view: you are reweighting old experience by how much more/less likely actions are now
  • PPO failures in RLHF often trace to advantage estimation issues—reward model noise amplifies errors

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
LCLIP(θ)=E[min(rtA^t,clip(rt,1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}\left[ \min\left( r_t \hat{A}_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

PPO optimizes policies with clipped surrogate objectives:

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}_t\left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

where the probability ratio is:

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}

Advantage estimation (GAE):

A^t=l=0(γλ)lδt+l,δt=rt+γV(st+1)V(st)\hat{A}_t = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

The clipping prevents too-large policy updates that destabilize training.

Canonical Papers

Proximal Policy Optimization Algorithms

Schulman et al.2017arXiv
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.