Legacy Concept Lab

Capability Elicitation & ELK

Safety evals must find worst-case, not average-case capability

Concept 93 of 100Scaling & AlignmentPhase 12

#93ElicitationScaling & Alignment

key equationg_\psi(h(x)) \approx z

Phase 12: Advanced alignment & safety researchConcept 93 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

g_\psi(h(x)) \approx z

Elicitation gap (capability as max over prompts):

\text{Cap}(M) = \max_{p \in \mathcal{P}} \mathbb{E}[\text{score}(M, p)]

ELK: extract truth from internals even when output unreliable:

g_\psi(h(x)) \approx z

where $h(x)$ = activations, $z$ = latent truth, even if output $y \ne z$ .

Christiano et al.2021Alignment Forum

Explore this concept from different angles — like a mathematician would.