Legacy Concept Lab

Mesa-Optimization & Inner Alignment

Explains why "passes training tests" ≠ "has right objective"

Concept 90 of 100Scaling & AlignmentPhase 12

#90Mesa-OptScaling & Alignment

key equation

f_\theta(x) = \arg\max_a m_\theta(a; x)

Phase 12: Advanced alignment & safety researchConcept 90 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

Outer objective = training loss; inner objective = what model actually optimizes
Goal misgeneralization: capabilities generalize, goals don't
Aligned on train distribution, diverges on deployment

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

f_\theta(x) = \arg\max_a m_\theta(a; x)

Outer training:

\theta^* = \arg\min_\theta \mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{train}}}[\mathcal{L}(f_\theta(x), y)]

But learned system may implement internal search:

f_\theta(x) = \arg\max_{a \in \mathcal{A}} m_\theta(a; x)

where $m_\theta$ is an implicit mesa-objective ≠ designer's goal.

Hubinger et al.2019arXiv

Explore this concept from different angles — like a mathematician would.