Legacy Concept Lab

AI Safety via Debate

Targets evaluation difficulty: we can judge arguments even when we can't judge answers

Concept 86 of 100Scaling & AlignmentPhase 12
#86DebateScaling & Alignment
key equation\max_{\pi_A} \min_{\pi_B} \mathbb{E}[J(\tau)]
Phase 12: Advanced alignment & safety researchConcept 86 of 100

Why It Matters for Modern Models

  • Targets evaluation difficulty: we can judge arguments even when we can't judge answers
  • Scalable oversight: judge weaker than debaters can still pick truth
  • Adversarial structure surfaces hidden flaws in reasoning

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Think Socratic dialogue meets adversarial training
  • Claims + evidence + counterexample structure
  • Judge accuracy improves with debate length

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
maxπAminπBE[J(τ)]\max_{\pi_A} \min_{\pi_B} \mathbb{E}[J(\tau)]

Two agents debate; judge picks winner. Zero-sum game:

maxπAminπBE[J(τ)]\max_{\pi_A} \min_{\pi_B} \mathbb{E}[J(\tau)]

where τ\tau is the debate transcript and J(τ){0,1}J(\tau) \in \{0,1\} indicates A wins.

Key insight: self-play pushes agents toward truthful, checkable arguments.

Canonical Papers

AI safety via debate

Irving et al.2018arXiv
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.