Legacy Concept Lab

Constitutional AI: Principles-Based Alignment

How Claude is trained—principles replace pure human preference labeling

Concept 69 of 100Scaling & AlignmentPhase 7
#69ConstitutionalScaling & Alignment
key equationr_{revised} = \text{Critique}(r_{initial}, \text{principle})
Phase 7: Alignment & RLHFConcept 69 of 100

Why It Matters for Modern Models

  • How Claude is trained—principles replace pure human preference labeling
  • Scales better than RLHF: AI can critique faster than humans can label
  • More interpretable: you can see which principles guide behavior

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • The "constitution" is just a set of principles like "be honest" and "don't help with harm"
  • AI feedback can bootstrap from a smaller set of human preferences
  • Self-critique is iterative: multiple rounds of revision improve outputs

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
rrevised=Critique(rinitial,principle)r_{revised} = \text{Critique}(r_{initial}, \text{principle})

Constitutional AI (CAI) uses a two-stage process:

Stage 1 - Self-Critique: Model generates response, then critiques it:

rcritique=LLM(prompt,rinitial,principle)r_{critique} = \text{LLM}(\text{prompt}, r_{initial}, \text{principle})
rrevised=LLM(prompt,rinitial,rcritique)r_{revised} = \text{LLM}(\text{prompt}, r_{initial}, r_{critique})

Stage 2 - RLAIF: Train reward model on AI-generated preferences:

LRM=logσ(rθ(rchosen)rθ(rrejected))\mathcal{L}_{RM} = -\log \sigma(r_\theta(r_{chosen}) - r_\theta(r_{rejected}))

Key insight: Principles ("Be helpful", "Don't be harmful") can guide AI feedback.

Canonical Papers

Constitutional AI: Harmlessness from AI Feedback

Bai et al.2022arXiv
Read paper →

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai et al.2022arXiv
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.