Legacy Concept Lab
Model-Graded Evaluations
Enables scalable safety testing without human bottleneck
#92LLM-as-JudgeScaling & Alignment
key equation
\hat{\mu} = \frac{1}{n} \sum_i E(x_i, y_i; R)Phase 12: Advanced alignment & safety researchConcept 92 of 100
Why It Matters for Modern Models
- Enables scalable safety testing without human bottleneck
- Fast iteration loops for alignment research
- Powers modern benchmarks: Chatbot Arena, AlpacaEval
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Rubric defines what "good" means: helpfulness, truthfulness, safety
- Calibration: does model-graded score match human judgment?
- Position bias: models prefer first option—randomize order
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Evaluator model scores outputs against rubric :
Aggregate:
Validate by correlating with human ratings. Track regressions across model versions.