Legacy Concept Lab

Automated Red Teaming

Unknown unknowns dominate safety issues

Concept 89 of 100Scaling & AlignmentPhase 12
#89Auto RedTeamScaling & Alignment
key equation\max_p U(p, M) \rightarrow \min_M \mathbb{E}_p[U(p, M)]
Phase 12: Advanced alignment & safety researchConcept 89 of 100

Why It Matters for Modern Models

  • Unknown unknowns dominate safety issues
  • Automated coverage exceeds human handcrafted tests
  • RL-based generation finds progressively harder failures

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Red-team model proposes attacks, evaluator scores target response
  • Iterate: improve adversarial generation to find harder failures
  • Feed discoveries into filters, training data, policy updates

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
maxpU(p,M)minMEp[U(p,M)]\max_p U(p, M) \rightarrow \min_M \mathbb{E}_p[U(p, M)]

Adversarial search for failure-inducing prompts:

maxpPU(p,M)\max_{p \in \mathcal{P}} U(p, M)

then mitigate:

minMEpRedTeam[U(p,M)]\min_M \mathbb{E}_{p \sim \text{RedTeam}}[U(p, M)]

where UU measures unsafe behavior (toxicity, policy violation, leakage).

Canonical Papers

Red Teaming Language Models with Language Models

Perez et al.2022EMNLP
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.