Legacy Concept Lab

Sandwiching Evaluations

Makes scalable oversight empirically testable today

Concept 94 of 100Scaling & AlignmentPhase 12
#94SandwichScaling & Alignment
key equation\text{Score} = \frac{P_{H+A} - P_H}{P_E - P_H}
Phase 12: Advanced alignment & safety researchConcept 94 of 100

Why It Matters for Modern Models

  • Makes scalable oversight empirically testable today
  • Choose tasks where experts can judge, non-experts struggle
  • Proxy for future "smart model oversight" capabilities

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Bottom = unaided human, top = expert, middle = AI-assisted human
  • Tests: can weaker oversight + AI match stronger oversight?
  • Foundational benchmark for alignment research progress

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
Score=PH+APHPEPH\text{Score} = \frac{P_{H+A} - P_H}{P_E - P_H}

Sandwich score measures AI-assisted oversight:

SandwichScore=PH+APHPEPH\text{SandwichScore} = \frac{P_{H+A} - P_H}{P_E - P_H}
  • PHP_H: non-expert performance
  • PH+AP_{H+A}: non-expert + AI assistance
  • PEP_E: expert performance

Score = 1.0 means assisted non-expert matches expert.

Canonical Papers

Measuring Progress on Scalable Oversight for Large Language Models

Bowman et al.2022Anthropic
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.