Legacy Concept Lab

Mesa-Optimization & Inner Alignment

Explains why "passes training tests" ≠ "has right objective"

Concept 90 of 100Scaling & AlignmentPhase 12
#90Mesa-OptScaling & Alignment
key equation
fθ(x)=argmaxamθ(a;x)f_\theta(x) = \arg\max_a m_\theta(a; x)
Phase 12: Advanced alignment & safety researchConcept 90 of 100

Why It Matters for Modern Models

  • Explains why "passes training tests" ≠ "has right objective"
  • Risk grows when models learn internal search/planning
  • Central theoretical concern for advanced AI alignment

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Outer objective = training loss; inner objective = what model actually optimizes
  • Goal misgeneralization: capabilities generalize, goals don't
  • Aligned on train distribution, diverges on deployment

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
fθ(x)=argmaxamθ(a;x)f_\theta(x) = \arg\max_a m_\theta(a; x)

Outer training:

θ=argminθE(x,y)Dtrain[L(fθ(x),y)]\theta^* = \arg\min_\theta \mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{train}}}[\mathcal{L}(f_\theta(x), y)]

But learned system may implement internal search:

fθ(x)=argmaxaAmθ(a;x)f_\theta(x) = \arg\max_{a \in \mathcal{A}} m_\theta(a; x)

where mθm_\theta is an implicit mesa-objective ≠ designer's goal.

Canonical Papers

Risks from Learned Optimization in Advanced Machine Learning Systems

Hubinger et al.2019arXiv
Read paper →

Connections

Prerequisites

Next Moves

Explore this concept from different angles — like a mathematician would.