Legacy Concept Lab
Constitutional AI: Principles-Based Alignment
How Claude is trained—principles replace pure human preference labeling
#69ConstitutionalScaling & Alignment
key equation
r_{revised} = \text{Critique}(r_{initial}, \text{principle})Phase 7: Alignment & RLHFConcept 69 of 100
Why It Matters for Modern Models
- How Claude is trained—principles replace pure human preference labeling
- Scales better than RLHF: AI can critique faster than humans can label
- More interpretable: you can see which principles guide behavior
What Tutorials Skip
What is still poorly explained in textbooks and papers:
- The "constitution" is just a set of principles like "be honest" and "don't help with harm"
- AI feedback can bootstrap from a smaller set of human preferences
- Self-critique is iterative: multiple rounds of revision improve outputs
Interactive Visualization
Core Math (Optional Deep Dive)
If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.
Key Equation
Constitutional AI (CAI) uses a two-stage process:
Stage 1 - Self-Critique: Model generates response, then critiques it:
Stage 2 - RLAIF: Train reward model on AI-generated preferences:
Key insight: Principles ("Be helpful", "Don't be harmful") can guide AI feedback.