Legacy Concept Lab

Constitutional AI: Principles-Based Alignment

How Claude is trained—principles replace pure human preference labeling

Concept 69 of 100Scaling & AlignmentPhase 7

#69ConstitutionalScaling & Alignment

key equationr_{revised} = \text{Critique}(r_{initial}, \text{principle})

Phase 7: Alignment & RLHFConcept 69 of 100

Why It Matters for Modern Models

What is still poorly explained in textbooks and papers:

The "constitution" is just a set of principles like "be honest" and "don't help with harm"
AI feedback can bootstrap from a smaller set of human preferences
Self-critique is iterative: multiple rounds of revision improve outputs

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation

r_{revised} = \text{Critique}(r_{initial}, \text{principle})

Constitutional AI (CAI) uses a two-stage process:

Stage 1 - Self-Critique: Model generates response, then critiques it:

r_{critique} = \text{LLM}(\text{prompt}, r_{initial}, \text{principle})

r_{revised} = \text{LLM}(\text{prompt}, r_{initial}, r_{critique})

Stage 2 - RLAIF: Train reward model on AI-generated preferences:

\mathcal{L}_{RM} = -\log \sigma(r_\theta(r_{chosen}) - r_\theta(r_{rejected}))

Key insight: Principles ("Be helpful", "Don't be harmful") can guide AI feedback.

Bai et al.2022arXiv

Bai et al.2022arXiv

Explore this concept from different angles — like a mathematician would.