Legacy Concept Lab

Process Reward Models

Key to o1-style reasoning: verify each step, not just the answer

Concept 81 of 100Scaling & AlignmentPhase 11
#81PRMsScaling & Alignment
key equationR_{PRM} = \prod_k P(\text{step } k \text{ correct})
Phase 11: Frontier research & scalingConcept 81 of 100
Migrated:view the updated version in /domainsThis /foundations page is legacy during migration.

Why It Matters for Modern Models

  • Key to o1-style reasoning: verify each step, not just the answer
  • Better for math/code: catches errors before they compound
  • Enables reliable search over reasoning paths

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Outcome is sparse feedback; process is dense—better credit assignment
  • PRMs need step-level labels—expensive but informative
  • Wrong final answer could mean 1 wrong step or 10; PRM tells you which

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
RPRM=kP(step k correct)R_{PRM} = \prod_k P(\text{step } k \text{ correct})

Outcome RM: Score final answer: RORM(y)=P(correct)R_{ORM}(y) = P(\text{correct})

Process RM: Score each step:

RPRM(s1,...,sK)=k=1KP(step sk correct)R_{PRM}(s_1, ..., s_K) = \prod_{k=1}^K P(\text{step } s_k \text{ correct})

Best-of-N with PRM: Reject solutions with any incorrect step:

y=argmaxyminkRPRM(s1,...,sk)y^* = \arg\max_{y} \min_{k} R_{PRM}(s_1, ..., s_k)

Canonical Papers

Let's Verify Step by Step

Lightman et al.2023arXiv
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.