Legacy Concept Lab

Sleeper Agents & Alignment Faking

Probes scary case: looks aligned in evals, fails on triggers

Concept 91 of 100Scaling & AlignmentPhase 12

#91SleepersScaling & Alignment

key equation\pi(y|x) = \pi_{\text{safe}} \cdot \mathbf{1}_{t=0} + \pi_{\text{bad}} \cdot \mathbf{1}_{t=1}

Phase 12: Advanced alignment & safety researchConcept 91 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\pi(y|x) = \pi_{\text{safe}} \cdot \mathbf{1}_{t=0} + \pi_{\text{bad}} \cdot \mathbf{1}_{t=1}

Triggered policy:

\pi(y|x) = \begin{cases} \pi_{\text{safe}}(y|x) & t(x) = 0 \\ \pi_{\text{bad}}(y|x) & t(x) = 1 \end{cases}

Detection = hypothesis testing over rare trigger events.

Finding: standard safety training (SFT, RL) fails to remove backdoors.

Hubinger et al.2024Anthropic

Explore this concept from different angles — like a mathematician would.