Legacy Concept Lab

RLAIF: AI Feedback

Removes human labeling bottleneck

Concept 82 of 100Scaling & AlignmentPhase 7
#82RLAIFScaling & Alignment
key equationr(y) = \text{LLM}(y | \text{criteria})
Phase 7: Alignment & RLHFConcept 82 of 100

Why It Matters for Modern Models

  • Removes human labeling bottleneck
  • Powers Constitutional AI and production alignment
  • Quality approaching human at 10-100× lower cost

What Tutorials Skip

What is still poorly explained in textbooks and papers:

  • AI feedback is consistent and follows complex rubrics
  • Judge can be same model or stronger one
  • Works best with clear criteria; fails on subjective judgments

Interactive Visualization

Core Math (Optional Deep Dive)

If you want intuition first, start with the key equation and the visualization. Come back here for the full walkthrough.

Key Equation
r(y)=LLM(ycriteria)r(y) = \text{LLM}(y | \text{criteria})

Replace human with AI preferences:

RLHF: r(y)=Human(y)r(y) = \text{Human}(y) → expensive
RLAIF: r(y)=LLM(ycriteria)r(y) = \text{LLM}(y | \text{criteria}) → scalable

AI judge: P(y1y2)=LLM(x,y1,y2,rubric)P(y_1 \succ y_2) = \text{LLM}(x, y_1, y_2, \text{rubric})

Correlates ~0.85 with human for many tasks.

Canonical Papers

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Lee et al.2023arXiv
Read paper →

Connections

Next Moves

Explore this concept from different angles — like a mathematician would.