Legacy Concept Lab

Deliberative Alignment

Trains models on explicit specifications rather than implicit reward shaping

Concept 85 of 100Scaling & AlignmentPhase 12

#85DeliberativeScaling & Alignment

key equation\max_\pi \mathbb{E}[r_{\text{help}}] + \lambda \mathbb{E}[v_S]

Phase 12: Advanced alignment & safety researchConcept 85 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

\max_\pi \mathbb{E}[r_{\text{help}}] + \lambda \mathbb{E}[v_S]

Train model to reason over safety specifications $S$ :

Constrained optimization view:

\max_\pi \mathbb{E}[r_{\text{help}}(x,y)] \quad \text{s.t.} \quad \mathbb{E}[v_S(x,y)] \ge \tau

Lagrangian form:

\max_\pi \mathbb{E}[r_{\text{help}}] + \lambda \mathbb{E}[v_S]

where $v_S$ scores compliance with spec text $S$ .

OpenAI2024arXiv

Explore this concept from different angles — like a mathematician would.