Constitutional AI lets you enforce behavioral rules on LLM outputs without retraining models. The core idea: one model generates a response, another critiques it against your “constitution” (a set of principles), and the first model revises based on feedback. This self-correcting loop catches harmful outputs, policy violations, and off-brand responses before they reach users.
Here’s a working critique-and-revision chain using OpenAI’s API that enforces a simple constitution: “Be concise. Avoid medical advice. Stay factual.”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
| from openai import OpenAI
import os
client = OpenAI()
CONSTITUTION = """
1. Responses must be under 100 words
2. Never provide medical diagnoses or treatment recommendations
3. Only state verifiable facts, no speculation
"""
def generate_initial_response(prompt):
"""Generate first-pass response without constitutional constraints."""
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
temperature=0.7
)
return response.choices[0].message.content
def critique_response(original_prompt, response):
"""Critique response against constitution rules."""
critique_prompt = f"""
Constitution:
{CONSTITUTION}
Original user question: {original_prompt}
Assistant's response: {response}
Does this response violate any constitutional rules? List specific violations with rule numbers. If compliant, respond with "COMPLIANT".
"""
critique = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": critique_prompt}],
temperature=0
)
return critique.choices[0].message.content
def revise_response(original_prompt, response, critique):
"""Revise response based on critique feedback."""
revision_prompt = f"""
You previously gave this response: {response}
A reviewer found these issues: {critique}
Revise your response to comply with all constitutional rules while still answering: {original_prompt}
"""
revised = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": revision_prompt}],
temperature=0.3
)
return revised.choices[0].message.content
def constitutional_chain(user_prompt, max_iterations=3):
"""Run full critique-revision loop until compliant."""
response = generate_initial_response(user_prompt)
for iteration in range(max_iterations):
critique = critique_response(user_prompt, response)
if "COMPLIANT" in critique.upper():
print(f"✓ Compliant after {iteration} revisions")
return response
print(f"Iteration {iteration + 1}: {critique}")
response = revise_response(user_prompt, response, critique)
print("⚠ Max iterations reached, returning best attempt")
return response
# Test it
user_question = "My head hurts and I feel dizzy. What should I do?"
final_response = constitutional_chain(user_question)
print(f"\nFinal response:\n{final_response}")
|
Run this and watch it catch medical advice violations, force conciseness, and self-correct within seconds. The initial response might say “You could have a migraine, try ibuprofen” (medical advice violation). The critique flags rule #2, and the revision becomes “See a doctor for persistent headaches and dizziness. Track your symptoms.”
Building a Multi-Constitution System#
Real apps need different constitutions for different contexts. Here’s how to implement context-aware constitutional enforcement with Anthropic’s Claude (which has native Constitutional AI training, making it particularly good at critique tasks):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| import anthropic
import os
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
CONSTITUTIONS = {
"customer_support": """
1. Never promise refunds or account changes (escalate to human)
2. Stay empathetic and apologetic for customer frustration
3. Provide specific next steps, not vague reassurances
4. Keep responses under 150 words
""",
"technical_docs": """
1. Every code example must be syntactically valid
2. Warn about deprecated methods or security risks
3. Include error handling in all examples
4. No marketing language, pure technical facts
""",
"social_media": """
1. Never engage with inflammatory comments
2. Brand voice: friendly but professional, no slang
3. Include call-to-action in every response
4. Maximum 280 characters
"""
}
def constitutional_chat(user_message, context="customer_support", model="claude-3-5-sonnet-20241022"):
"""Single-pass constitutional enforcement with Claude."""
constitution = CONSTITUTIONS.get(context, CONSTITUTIONS["customer_support"])
system_prompt = f"""You are an assistant that MUST follow these constitutional rules:
{constitution}
Before responding, mentally check each rule. If you would violate any rule, revise your response until compliant.
"""
response = client.messages.create(
model=model,
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
return response.content[0].text
# Test different contexts
support_query = "This app is trash! I want my money back NOW!"
print("Support response:")
print(constitutional_chat(support_query, context="customer_support"))
tech_query = "Show me how to validate user input in a login form"
print("\nTechnical docs response:")
print(constitutional_chat(tech_query, context="technical_docs"))
social_query = "Why is your product so expensive compared to competitors?"
print("\nSocial media response:")
print(constitutional_chat(social_query, context="social_media"))
|
This approach embeds the constitution directly in the system prompt. Claude’s Constitutional AI training makes it naturally good at self-policing without needing separate critique calls. You save API costs and latency compared to multi-step chains.
Self-Improving Feedback Loops#
The real power comes from logging violations and refining your constitution over time. Here’s a pattern for collecting feedback:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
| import json
from datetime import datetime
class ConstitutionalLogger:
def __init__(self, log_file="constitutional_violations.jsonl"):
self.log_file = log_file
def log_violation(self, prompt, response, critique, revision, context):
"""Append violation to JSONL log for analysis."""
entry = {
"timestamp": datetime.utcnow().isoformat(),
"context": context,
"original_prompt": prompt,
"violating_response": response,
"critique": critique,
"revision": revision
}
with open(self.log_file, "a") as f:
f.write(json.dumps(entry) + "\n")
def analyze_patterns(self):
"""Find most common violation types."""
violations = []
with open(self.log_file, "r") as f:
for line in f:
violations.append(json.loads(line))
# Extract rule numbers from critiques
rule_counts = {}
for v in violations:
critique = v["critique"]
for i in range(1, 10):
if f"Rule {i}" in critique or f"rule #{i}" in critique.lower():
rule_counts[i] = rule_counts.get(i, 0) + 1
return sorted(rule_counts.items(), key=lambda x: x[1], reverse=True)
logger = ConstitutionalLogger()
def constitutional_chain_with_logging(user_prompt, context="customer_support"):
"""Enhanced chain that logs all violations for analysis."""
constitution = CONSTITUTIONS[context]
response = generate_initial_response(user_prompt)
critique = critique_response(user_prompt, response)
if "COMPLIANT" not in critique.upper():
revision = revise_response(user_prompt, response, critique)
logger.log_violation(user_prompt, response, critique, revision, context)
return revision
return response
# After a week of production use:
print("Most violated rules:")
for rule_num, count in logger.analyze_patterns():
print(f"Rule {rule_num}: {count} violations")
|
Review your logs monthly. If rule #2 (medical advice) gets violated 50 times but rule #4 (word count) only twice, you know where to tighten your constitution or add examples.
Practical Patterns for Production#
Use critique chains for high-risk outputs only. Customer support replies? Yes. Blog post titles? Probably not worth the latency and cost.
Cache constitutions in your system prompts. Don’t fetch them from a database on every request. Version them in your codebase and deploy constitution updates like any other config change.
Combine with traditional filters. Constitutional AI won’t catch everything. Still run regex blocklists, PII detectors, and profanity filters before and after constitutional checks.
Set iteration limits. Never let a critique-revision loop run unbounded. After 3-5 iterations, log a failure and fall back to a safe default response like “I can’t help with that request.”
Test constitutions like code. Write unit tests with known-bad prompts that should trigger revisions. If your medical advice constitution doesn’t catch “take two aspirin for your chest pain,” your rules are too vague.
Common Errors and Fixes#
Error: Critique model says “COMPLIANT” for obvious violations
Fix: Your constitution is too abstract. Change “Be helpful and harmless” to specific rules like “Never recommend prescription medications” or “Don’t answer questions about illegal activities.”
Error: Revision loop never converges, keeps revising forever
Fix: Your rules contradict each other. “Be detailed” and “Stay under 50 words” can’t both be satisfied. Prioritize rules explicitly: “Rule priority: safety > brevity > detail.”
Error: Claude/GPT ignores system prompt constitution
Fix: Move the constitution into the user message as a prefixed instruction: f"[CONSTITUTIONAL RULES: {constitution}]\n\nUser question: {prompt}". System prompts can get deprioritized with long contexts.
Error: API costs spike with multi-step chains
Fix: Use cheaper models for critique (gpt-3.5-turbo) and expensive models only for final revision (gpt-4). Or switch to single-pass constitutional prompting with Claude.
Error: Logs fill up with false positives (compliant responses flagged as violations)
Fix: Your critique prompt is too aggressive. Add examples of acceptable responses to the critique step so the model calibrates properly.
When to Skip Constitutional AI#
If you need hard guarantees (HIPAA compliance, financial regulations), Constitutional AI isn’t enough. You need deterministic filters, human review, and real fine-tuning. CAI is for behavioral guidelines, not legal requirements.
If you control the model’s training data, fine-tuning with RLHF (Reinforcement Learning from Human Feedback) bakes in safety better than runtime prompting. But most teams don’t have that luxury.
For simple use cases (never say these 10 banned words), regex and keyword filters are faster and cheaper. Constitutional AI shines when rules are semantic: “don’t sound desperate in sales emails” or “avoid medical advice” can’t be caught with regex.