Legacy Concept Lab
Process Reward Models
Key to o1-style reasoning: verify each step, not just the answer
#81PRMsScaling & Alignment
key equation
R_{PRM} = \prod_k P(\text{step } k \text{ correct})Phase 11: Frontier research & scalingConcept 81 of 100
Why It Matters for Modern Models
- Key to o1-style reasoning: verify each step, not just the answer
- Better for math/code: catches errors before they compound
- Enables reliable search over reasoning paths
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Outcome is sparse feedback; process is dense—better credit assignment
- PRMs need step-level labels—expensive but informative
- Wrong final answer could mean 1 wrong step or 10; PRM tells you which
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Outcome RM: Score final answer:
Process RM: Score each step:
Best-of-N with PRM: Reject solutions with any incorrect step: