Legacy Concept Lab
Capability Elicitation & ELK
Safety evals must find worst-case, not average-case capability
#93ElicitationScaling & Alignment
key equation
g_\psi(h(x)) \approx zPhase 12: Advanced alignment & safety researchConcept 93 of 100
Why It Matters for Modern Models
- Safety evals must find worst-case, not average-case capability
- ELK: can we trust what model says when it could be deceptive?
- Core theoretical obstacle to alignment
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Scaffolding/prompting can dramatically change apparent capability
- Model may "know" truth internally but output something else
- Probes on activations might extract honest beliefs
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Elicitation gap (capability as max over prompts):
ELK: extract truth from internals even when output unreliable:
where = activations, = latent truth, even if output .