Legacy Concept Lab

Automated Red Teaming

Unknown unknowns dominate safety issues

Concept 89 of 100Scaling & AlignmentPhase 12

#89Auto RedTeamScaling & Alignment

key equation\max_p U(p, M) \rightarrow \min_M \mathbb{E}_p[U(p, M)]

Phase 12: Advanced alignment & safety researchConcept 89 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\max_p U(p, M) \rightarrow \min_M \mathbb{E}_p[U(p, M)]

Adversarial search for failure-inducing prompts:

\max_{p \in \mathcal{P}} U(p, M)

then mitigate:

\min_M \mathbb{E}_{p \sim \text{RedTeam}}[U(p, M)]

where $U$ measures unsafe behavior (toxicity, policy violation, leakage).

Perez et al.2022EMNLP

Explore this concept from different angles — like a mathematician would.