Legacy Concept Lab

Deliberative Alignment

Trains models on explicit specifications rather than implicit reward shaping

Concept 85 of 100Scaling & AlignmentPhase 12
#85DeliberativeScaling & Alignment
key equation\max_\pi \mathbb{E}[r_{\text{help}}] + \lambda \mathbb{E}[v_S]
Phase 12: Advanced alignment & safety researchConcept 85 of 100

Why It Matters for Modern Models

  • Trains models on explicit specifications rather than implicit reward shaping
  • Enables auditability: which policy clauses were consulted?
  • Reduces over-refusal while improving jailbreak robustness

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • Model retrieves relevant policy text, reasons about it, then responds
  • Like Constitutional AI but with explicit spec document in context
  • Pareto frontier: helpfulness vs safety vs over-refusal

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
maxπE[rhelp]+λE[vS]\max_\pi \mathbb{E}[r_{\text{help}}] + \lambda \mathbb{E}[v_S]

Train model to reason over safety specifications SS:

Constrained optimization view:

maxπE[rhelp(x,y)]s.t.E[vS(x,y)]τ\max_\pi \mathbb{E}[r_{\text{help}}(x,y)] \quad \text{s.t.} \quad \mathbb{E}[v_S(x,y)] \ge \tau

Lagrangian form:

maxπE[rhelp]+λE[vS]\max_\pi \mathbb{E}[r_{\text{help}}] + \lambda \mathbb{E}[v_S]

where vSv_S scores compliance with spec text SS.

Canonical Papers

Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI2024arXiv
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.