Legacy Concept Lab

Capability Elicitation & ELK

Safety evals must find worst-case, not average-case capability

Concept 93 of 100Scaling & AlignmentPhase 12
#93ElicitationScaling & Alignment
key equationg_\psi(h(x)) \approx z
Phase 12: Advanced alignment & safety researchConcept 93 of 100

Why It Matters for Modern Models

  • Safety evals must find worst-case, not average-case capability
  • ELK: can we trust what model says when it could be deceptive?
  • Core theoretical obstacle to alignment

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Scaffolding/prompting can dramatically change apparent capability
  • Model may "know" truth internally but output something else
  • Probes on activations might extract honest beliefs

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
gψ(h(x))zg_\psi(h(x)) \approx z

Elicitation gap (capability as max over prompts):

Cap(M)=maxpPE[score(M,p)]\text{Cap}(M) = \max_{p \in \mathcal{P}} \mathbb{E}[\text{score}(M, p)]

ELK: extract truth from internals even when output unreliable:

gψ(h(x))zg_\psi(h(x)) \approx z

where h(x)h(x) = activations, zz = latent truth, even if output yzy \ne z.

Canonical Papers

ARC's First Technical Report: Eliciting Latent Knowledge

Christiano et al.2021Alignment Forum
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.