Legacy Concept Lab
AI Safety via Debate
Targets evaluation difficulty: we can judge arguments even when we can't judge answers
#86DebateScaling & Alignment
key equation
\max_{\pi_A} \min_{\pi_B} \mathbb{E}[J(\tau)]Phase 12: Advanced alignment & safety researchConcept 86 of 100
Why It Matters for Modern Models
- Targets evaluation difficulty: we can judge arguments even when we can't judge answers
- Scalable oversight: judge weaker than debaters can still pick truth
- Adversarial structure surfaces hidden flaws in reasoning
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Think Socratic dialogue meets adversarial training
- Claims + evidence + counterexample structure
- Judge accuracy improves with debate length
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Two agents debate; judge picks winner. Zero-sum game:
where is the debate transcript and indicates A wins.
Key insight: self-play pushes agents toward truthful, checkable arguments.