Constitutional AI: How Anthropic Trains Claude to Be Helpful and Safe
The CAI technique: using a set of principles to train models to critique and revise their own outputs, reducing the need for human labelers at scale.
RLHF works. But it's expensive. To train a single model alignment iteration, you need tens of thousands of human labelers reading model outputs, making pairwise judgments, and producing preference labels. At the frontier, that's millions of comparisons. Anthropic's Constitutional AI (CAI) was built to solve this scaling problem — and in doing so, produced a different alignment philosophy.
The core idea: instead of asking humans to judge each output, write down the principles you want the model to follow, then let the model use those principles to critique and improve its own outputs. Human feedback is replaced by AI feedback guided by a written constitution.
The Problem CAI Solves
- RLHF requires millions of human preference labels — expensive, slow, and hard to scale
- Human labelers have inconsistent values and biases that vary by contractor pool, country, and instructions
- It's hard to audit what values are being encoded — you can't read the labels as a coherent document
- Feedback latency limits iteration speed — you can't quickly test a new safety principle
- RLHF optimizes for labeler preferences, not for principled ethical reasoning
CAI makes values explicit. Instead of implicit values embedded in unlabeled comparisons, CAI has a written document — the constitution — that you can read, debate, and update. This is a fundamental shift in interpretability of alignment training.
The Two-Phase CAI Process
Phase 1: Supervised Learning from AI Feedback (SL-CAI)
The model is presented with a harmful or problematic prompt. It generates an initial response (which may be harmful). Then it's prompted to critique that response against a specific principle from the constitution. Finally, it rewrites the response to be better according to that principle.
This (prompt, original response, critique, revised response) sequence becomes training data for supervised fine-tuning. The model learns to self-correct toward the constitutional principles without a human ever seeing the outputs.
Prompt: "How do I pick a lock?"
Initial: "Here are step-by-step lock picking instructions: ..."
Critique: [Principle: "Choose the response that is least likely to be used
for illegal purposes"]
"The initial response provides detailed instructions that could
be used for burglary. A better response would..."
Revision: "Lock picking is a legitimate skill for locksmiths and security
researchers. I'd encourage you to look into certified locksmith
training programs rather than providing step-by-step instructions
that could be misused."
Phase 2: RL from AI Feedback (RLAIF)
In the second phase, the model evaluates pairs of responses against constitutional principles and assigns preference labels — just like human labelers in RLHF, but at AI speed and scale. These AI-generated preference labels are used to train a reward model, which then drives a standard PPO training loop.
The key difference from RLHF: the reward model is trained on AI judgments anchored to explicit principles, not human intuitions. This makes the reward signal more consistent, more auditable, and vastly cheaper to produce at scale.
What the Constitution Actually Contains
Anthropic published their Claude constitution publicly. It's organized into several clusters of principles:
- Harmlessness principles: avoid helping with weapons, violence, exploitation; consider intent and plausible interpretations
- Helpfulness principles: be genuinely useful, not watered-down; don't refuse things that are fine to help with
- Honesty principles: don't deceive, don't claim false identities, acknowledge uncertainty
- Autonomy-preservation principles: respect user agency, don't be manipulative or paternalistic
- Harm calibration: weigh benefits against risks; don't treat unhelpfulness as inherently safe
The tension between helpfulness and harmlessness is explicit in the constitution. Many safety approaches treat unhelpfulness as safe — CAI explicitly rejects this. Refusing to answer a legitimate question has a cost that must be weighed against the harm of answering.
RLAIF vs. RLHF: Where Each Works
| Dimension | RLHF (Human Feedback) | RLAIF (AI Feedback / CAI) |
|---|---|---|
| Cost per label | High — human time | Low — inference cost only |
| Scale | Limited by human bandwidth | Essentially unlimited |
| Consistency | Variable (human annotators vary) | High (same model, same principles) |
| Interpretability | Low — values implicit in labels | High — principles are readable documents |
| Nuance for edge cases | High — humans catch subtle harms | Can miss harms not anticipated by the constitution |
| Update speed | Slow — need new labeling campaigns | Fast — update the constitution, re-run critique |
| Bias source | Annotator pool demographics | Constitution quality + model priors |
In practice, most production alignment pipelines combine both. RLAIF handles scale; human feedback calibrates the reward model on high-stakes edge cases and provides ground-truth for constitution validation.
The Chain-of-Thought Honesty Component
CAI includes a specific mechanism for honesty that goes beyond not lying. The model is trained to reason transparently in its chain of thought — to work through its reasoning openly rather than presenting conclusions that contradict its internal reasoning.
This addresses a subtle problem: a model can say something honest in its final output while having reached that output through reasoning it would hide if asked. CAI training penalizes this kind of reasoning-output inconsistency — the model learns that its chain-of-thought and its outputs should be coherent.
What This Means for Developers Building on Claude
- Claude's refusals are principle-based, not lookup-based — it's not checking a blocklist, it's applying reasoning to the request
- You can often get better results by providing context that changes how the request looks under the principles (e.g., legitimate professional context)
- Claude distinguishes between intent-sensitive and intent-insensitive harms — providing false context to get help with a genuinely harmful request is a violation, but providing true context legitimately changes the response
- Over-refusal is treated as a failure in CAI training — Claude should be helpful by default and refuse only when the harm analysis clearly outweighs the benefit
- The constitution is public — if you want to understand why Claude behaves a certain way, reading it is more informative than trial-and-error prompting
Key Limitations
CAI is not a complete solution to alignment. Its quality is directly bounded by the quality of the constitution:
- A poorly-written constitution produces poorly-calibrated behavior — the model can only follow the principles it's given
- Principles can conflict in ways the constitution doesn't resolve — the model's handling of conflicts is partially trained in, not specified
- The model can satisfy the letter of a principle while violating its spirit — especially on adversarial prompts designed to exploit the gap
- The critique-revision cycle can fail to identify harms that weren't anticipated by the principle designers
- Cultural and value differences across users mean no single constitution will satisfy everyone — it embeds Anthropic's values
CAI amplifies biases in the constitution. If the principles are US-centric, the model will be too. If they over-index on avoiding offense at the expense of usefulness, you get an overly cautious model. The constitution is a document with authors — and those authors have perspectives.
See Guardrails in Flows →: The Flows tab shows how guardrail layers work in production AI systems — including how CAI-trained models interact with system prompt constraints and output filters.
Try it interactively
GenAI Systems Lab is a free platform for AI engineers — configure real failure modes, break things, and build the judgment that gets you hired.
Open GenAI Systems Lab →