Legacy Concept Lab

Process Reward Models

Key to o1-style reasoning: verify each step, not just the answer

Concept 81 of 100Scaling & AlignmentPhase 11

#81PRMsScaling & Alignment

key equationR_{PRM} = \prod_k P(\text{step } k \text{ correct})

Phase 11: Frontier research & scalingConcept 81 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

R_{PRM} = \prod_k P(\text{step } k \text{ correct})

Outcome RM: Score final answer: $R_{ORM}(y) = P(\text{correct})$

Process RM: Score each step:

R_{PRM}(s_1, ..., s_K) = \prod_{k=1}^K P(\text{step } s_k \text{ correct})

Best-of-N with PRM: Reject solutions with any incorrect step:

y^* = \arg\max_{y} \min_{k} R_{PRM}(s_1, ..., s_k)

Lightman et al.2023arXiv

Explore this concept from different angles — like a mathematician would.