Eyeballing LLM outputs stops working the moment you have more than a handful of prompts in production. You need a repeatable way to score quality across dimensions like accuracy, tone, and completeness – and you need it to run without you staring at each response. A rubric-based evaluation pipeline solves this. You define criteria, assign scoring scales, feed outputs through a grader (often another LLM), and get structured scores you can track over time.
Here’s the shape of what we’re building:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| from dataclasses import dataclass, field
@dataclass
class RubricCriterion:
name: str
description: str
scale_min: int = 1
scale_max: int = 5
@dataclass
class Rubric:
name: str
criteria: list[RubricCriterion] = field(default_factory=list)
# Define a rubric for evaluating customer support responses
support_rubric = Rubric(
name="customer_support_quality",
criteria=[
RubricCriterion(
name="accuracy",
description="Does the response contain factually correct information relevant to the query?",
),
RubricCriterion(
name="tone",
description="Is the tone professional, empathetic, and appropriate for customer support?",
),
RubricCriterion(
name="completeness",
description="Does the response fully address the customer's question without missing key details?",
),
RubricCriterion(
name="conciseness",
description="Is the response free of unnecessary filler and delivered in a reasonable length?",
),
],
)
|
That gives you a typed, reusable rubric. Now we need the machinery to actually score outputs against it.
Scoring Outputs with an LLM Grader#
The core idea: send each LLM output to a grading model along with the rubric criteria, and ask it to return structured scores. Using response_format with a JSON schema keeps the grader’s output parseable.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| import json
from openai import OpenAI
client = OpenAI()
def build_grading_prompt(rubric: Rubric, prompt_text: str, output_text: str) -> str:
criteria_block = "\n".join(
f"- {c.name} ({c.scale_min}-{c.scale_max}): {c.description}"
for c in rubric.criteria
)
return f"""You are an evaluation assistant. Score the following LLM output against each criterion.
PROMPT GIVEN TO THE LLM:
{prompt_text}
LLM OUTPUT:
{output_text}
RUBRIC CRITERIA (score each on the given scale):
{criteria_block}
Return a JSON object with a key "scores" containing an object mapping each criterion name to an integer score, and a key "reasoning" mapping each criterion name to a one-sentence justification."""
def grade_output(rubric: Rubric, prompt_text: str, output_text: str) -> dict:
grading_prompt = build_grading_prompt(rubric, prompt_text, output_text)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a strict evaluation grader. Return only valid JSON."},
{"role": "user", "content": grading_prompt},
],
response_format={"type": "json_object"},
temperature=0.0,
)
return json.loads(response.choices[0].message.content)
|
Setting temperature=0.0 makes scores more deterministic across runs. The json_object response format guarantees you get parseable JSON back instead of markdown-wrapped output.
Test it with a quick call:
1
2
3
4
5
6
7
8
9
10
| result = grade_output(
rubric=support_rubric,
prompt_text="A customer asks: 'How do I reset my password?'",
output_text="Click 'Forgot Password' on the login page, enter your email, and follow the reset link sent to your inbox. The link expires in 24 hours.",
)
print(json.dumps(result, indent=2))
# {
# "scores": {"accuracy": 5, "tone": 4, "completeness": 5, "conciseness": 5},
# "reasoning": {"accuracy": "The steps are correct and standard for password reset flows.", ...}
# }
|
Running a Full Evaluation Pipeline#
A single score is useful. A pipeline that runs multiple test cases across multiple prompt variants and collects all the results is what you actually need. Here’s a pipeline runner that takes a list of test cases and prompt variants, then scores everything:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| from dataclasses import dataclass
@dataclass
class TestCase:
name: str
user_prompt: str
expected_context: str # optional ground truth or context for the grader
@dataclass
class PromptVariant:
name: str
system_prompt: str
@dataclass
class EvalResult:
test_case: str
variant: str
scores: dict[str, int]
reasoning: dict[str, str]
def run_pipeline(
rubric: Rubric,
variants: list[PromptVariant],
test_cases: list[TestCase],
model: str = "gpt-4o-mini",
) -> list[EvalResult]:
results = []
for case in test_cases:
for variant in variants:
# Generate the output from the variant
gen_response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": variant.system_prompt},
{"role": "user", "content": case.user_prompt},
],
temperature=0.7,
)
output_text = gen_response.choices[0].message.content
# Grade it
grade = grade_output(rubric, case.user_prompt, output_text)
results.append(EvalResult(
test_case=case.name,
variant=variant.name,
scores=grade.get("scores", {}),
reasoning=grade.get("reasoning", {}),
))
print(f" Graded: {case.name} x {variant.name} -> {grade.get('scores', {})}")
return results
|
Run it with two prompt variants and a few test cases:
1
2
3
4
5
6
7
8
9
10
11
12
| variants = [
PromptVariant(name="formal", system_prompt="You are a professional customer support agent. Be formal and precise."),
PromptVariant(name="casual", system_prompt="You are a friendly support agent. Be casual and warm."),
]
test_cases = [
TestCase(name="password_reset", user_prompt="How do I reset my password?", expected_context="Standard password reset flow"),
TestCase(name="billing_dispute", user_prompt="I was charged twice for my subscription.", expected_context="Billing error resolution"),
TestCase(name="feature_request", user_prompt="Can you add dark mode to the app?", expected_context="Feature request handling"),
]
results = run_pipeline(support_rubric, variants, test_cases)
|
This gives you 6 evaluated results (3 test cases x 2 variants), each with per-criterion scores and reasoning.
Aggregating Results and Generating Reports#
Raw scores need aggregation to be useful. You want averages per variant, per criterion, and a way to spot which variant wins on what dimension.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| from collections import defaultdict
def aggregate_results(results: list[EvalResult], rubric: Rubric) -> dict:
variant_scores = defaultdict(lambda: defaultdict(list))
for r in results:
for criterion_name, score in r.scores.items():
variant_scores[r.variant][criterion_name].append(score)
report = {}
for variant, criteria in variant_scores.items():
report[variant] = {}
for criterion_name, scores in criteria.items():
avg = sum(scores) / len(scores)
report[variant][criterion_name] = {
"mean": round(avg, 2),
"min": min(scores),
"max": max(scores),
"n": len(scores),
}
# Overall average across all criteria
all_scores = [s for scores in criteria.values() for s in scores]
report[variant]["_overall"] = round(sum(all_scores) / len(all_scores), 2)
return report
def print_report(report: dict):
for variant, criteria in report.items():
overall = criteria.pop("_overall", "N/A")
print(f"\n{'='*50}")
print(f"Variant: {variant} (Overall: {overall})")
print(f"{'='*50}")
for criterion, stats in criteria.items():
print(f" {criterion:20s} mean={stats['mean']:.2f} min={stats['min']} max={stats['max']} n={stats['n']}")
report = aggregate_results(results, support_rubric)
print_report(report)
# ==================================================
# Variant: formal (Overall: 4.25)
# ==================================================
# accuracy mean=4.67 min=4 max=5 n=3
# tone mean=4.00 min=3 max=5 n=3
# completeness mean=4.33 min=4 max=5 n=3
# conciseness mean=4.00 min=3 max=5 n=3
#
# ==================================================
# Variant: casual (Overall: 4.08)
# ==================================================
# ...
|
You can also dump the raw report dict to JSON or CSV for tracking scores over time. If you wire this into CI, you get prompt regression detection for free – any new prompt version that drops below a threshold fails the build.
Customizing Rubrics for Different Tasks#
The same pipeline works for any task. Just swap the rubric. Here’s one for evaluating code generation quality:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| code_rubric = Rubric(
name="code_generation_quality",
criteria=[
RubricCriterion(
name="correctness",
description="Does the generated code produce the correct output for the stated task?",
),
RubricCriterion(
name="readability",
description="Is the code well-structured with clear variable names and appropriate comments?",
),
RubricCriterion(
name="efficiency",
description="Does the code avoid unnecessary operations and use appropriate data structures?",
),
RubricCriterion(
name="error_handling",
description="Does the code handle edge cases and potential errors gracefully?",
),
RubricCriterion(
name="follows_instructions",
description="Does the code implement exactly what was asked, without adding unrequested features?",
),
],
)
|
You can also adjust the scoring scale. For binary pass/fail criteria, set scale_min=0 and scale_max=1. For more granular differentiation, use a 1-10 scale. The grading prompt adapts automatically because it reads the scale from the rubric definition.
One thing worth noting: the grader model matters. Using gpt-4o as the grader produces more consistent and calibrated scores than smaller models. If cost is a concern, generate outputs with gpt-4o-mini but always grade with gpt-4o.
Common Errors and Fixes#
Grader returns scores outside the rubric scale. Even with temperature=0.0, LLMs sometimes return a 6 on a 1-5 scale. Clamp scores after parsing:
1
2
3
4
5
6
7
8
| def clamp_scores(scores: dict, rubric: Rubric) -> dict:
criteria_map = {c.name: c for c in rubric.criteria}
clamped = {}
for name, score in scores.items():
if name in criteria_map:
c = criteria_map[name]
clamped[name] = max(c.scale_min, min(c.scale_max, score))
return clamped
|
JSON parsing fails on grader response. Some models occasionally wrap JSON in markdown code fences. Strip them before parsing:
1
2
3
4
5
6
7
8
| def safe_parse_json(text: str) -> dict:
cleaned = text.strip()
if cleaned.startswith("```"):
# Remove markdown code fences
lines = cleaned.split("\n")
lines = [l for l in lines if not l.strip().startswith("```")]
cleaned = "\n".join(lines)
return json.loads(cleaned)
|
Scores are inconsistent across runs. Even at temperature=0.0, API responses can vary slightly. Run the grader 3 times per evaluation and take the median score for each criterion. This adds cost but significantly improves reliability.
Missing criteria in grader output. The grader sometimes omits a criterion, especially with longer rubrics. Validate the response and re-prompt if any criteria are missing:
1
2
3
4
| def validate_grade(grade: dict, rubric: Rubric) -> bool:
expected_criteria = {c.name for c in rubric.criteria}
returned_criteria = set(grade.get("scores", {}).keys())
return expected_criteria == returned_criteria
|
Rate limit errors during large evaluation runs. When running hundreds of evaluations, you’ll hit OpenAI rate limits. Add exponential backoff or use the tenacity library:
1
2
3
4
5
| from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=30))
def grade_output_with_retry(rubric, prompt_text, output_text):
return grade_output(rubric, prompt_text, output_text)
|
The full pattern gives you a reusable, extensible evaluation system. Define rubrics as data, run pipelines against test suites, aggregate scores, and track quality over time. When a prompt change regresses one dimension, you catch it before your users do.