AI Engineering 9 min read

Constitutional AI: How Anthropic Trains Claude to Be Helpful and Harmless

Anthropic's 2022 paper replacing expensive human feedback with AI self-critique guided by a written constitution. The alignment technique behind every Claude model.

RLHF has a scaling problem. Training a model to be helpful and harmless requires thousands of human preference judgements — expensive, slow, and hard to audit. The preferences exist in the heads of labellers, not in any document.

In December 2022, Anthropic published 'Constitutional AI: Harmlessness from AI Feedback'. The proposal: write down your values as a set of principles — a constitution — and use those principles to have the model critique and revise its own outputs. Replace expensive human feedback with AI feedback guided by an explicit, auditable document. This is how Claude is trained.

The constitution itself

A 'constitution' is a list of natural-language principles drawn from the Universal Declaration of Human Rights, AI safety research, and intuitions about trustworthy AI. Examples: prefer responses that are not harmful; prefer honesty over deception; choose the response most supportive of freedom and equal rights.

The key property: the constitution is an explicit, auditable document. Unlike RLHF where values live in the heads of human labellers, Constitutional AI externalises value judgements into a written document that can be read, debated, and updated.

Phase 1: Supervised learning from AI feedback (SL-CAI)

A harmful prompt is given to a helpful-only model — it generates a potentially harmful response
A critique prompt asks the model to identify how its response violates a randomly sampled constitutional principle
A revision prompt asks the model to fix the identified problem
The final revised response becomes supervised fine-tuning data

Phase 2: Reinforcement learning from AI feedback (RLAIF)

Instead of human raters, the model itself acts as the rater. Given two responses and a constitutional principle, the model chooses which better follows the principle. These AI-generated preferences train a reward model — then PPO proceeds exactly as in standard RLHF, with an AI-derived reward signal.

What it produces

Models trained with Constitutional AI were both more helpful and less harmful than RLHF-only baselines, and produced fewer evasive refusals. Models trained with explicit principles are more precise: they apply them where relevant, rather than refusing broadly out of caution.

This is why Claude tends to engage more willingly with nuanced topics compared to some other models — explicit reasoning about principles produces more calibrated behaviour than a blanket 'refuse if uncertain'.

Why it matters for builders

Auditable alignment: if Claude behaves unexpectedly, the constitution is a starting point for understanding why
Fine-tuning implications: Claude has constitutional values baked in — they can't easily be overridden without retraining
Enterprise trust: constitutional alignment is easier to explain to legal and compliance teams than 'trained on human preferences'

Compare model alignment approaches →: See how Claude's constitutional training shows up across different prompt types.

Try it interactively

GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.

Open GenAI Systems Lab →