Legacy Concept Lab

AI Safety via Debate

Targets evaluation difficulty: we can judge arguments even when we can't judge answers

Concept 86 of 100Scaling & AlignmentPhase 12

#86DebateScaling & Alignment

key equation\max_{\pi_A} \min_{\pi_B} \mathbb{E}[J(\tau)]

Phase 12: Advanced alignment & safety researchConcept 86 of 100

Why It Matters for Modern Models

Targets evaluation difficulty: we can judge arguments even when we can't judge answers
Scalable oversight: judge weaker than debaters can still pick truth
Adversarial structure surfaces hidden flaws in reasoning

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\max_{\pi_A} \min_{\pi_B} \mathbb{E}[J(\tau)]

Two agents debate; judge picks winner. Zero-sum game:

\max_{\pi_A} \min_{\pi_B} \mathbb{E}[J(\tau)]

where $\tau$ is the debate transcript and $J(\tau) \in \{0,1\}$ indicates A wins.

Key insight: self-play pushes agents toward truthful, checkable arguments.

Irving et al.2018arXiv

Explore this concept from different angles — like a mathematician would.