Legacy Concept Lab
Mesa-Optimization & Inner Alignment
Explains why "passes training tests" ≠ "has right objective"
#90Mesa-OptScaling & Alignment
key equation
Phase 12: Advanced alignment & safety researchConcept 90 of 100
Why It Matters for Modern Models
- Explains why "passes training tests" ≠ "has right objective"
- Risk grows when models learn internal search/planning
- Central theoretical concern for advanced AI alignment
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- Outer objective = training loss; inner objective = what model actually optimizes
- Goal misgeneralization: capabilities generalize, goals don't
- Aligned on train distribution, diverges on deployment
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Outer training:
But learned system may implement internal search:
where is an implicit mesa-objective ≠ designer's goal.