Eyeballing LLM outputs stops working the moment you have more than a handful of prompts in production. You need a repeatable way to score quality across dimensions like accuracy, tone, and completeness – and you need it to run without you staring at each response. A rubric-based evaluation pipeline solves this. You define criteria, assign scoring scales, feed outputs through a grader (often another LLM), and get structured scores you can track over time.

Here’s the shape of what we’re building:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from dataclasses import dataclass, field

@dataclass
class RubricCriterion:
    name: str
    description: str
    scale_min: int = 1
    scale_max: int = 5

@dataclass
class Rubric:
    name: str
    criteria: list[RubricCriterion] = field(default_factory=list)

# Define a rubric for evaluating customer support responses
support_rubric = Rubric(
    name="customer_support_quality",
    criteria=[
        RubricCriterion(
            name="accuracy",
            description="Does the response contain factually correct information relevant to the query?",
        ),
        RubricCriterion(
            name="tone",
            description="Is the tone professional, empathetic, and appropriate for customer support?",
        ),
        RubricCriterion(
            name="completeness",
            description="Does the response fully address the customer's question without missing key details?",
        ),
        RubricCriterion(
            name="conciseness",
            description="Is the response free of unnecessary filler and delivered in a reasonable length?",
        ),
    ],
)

That gives you a typed, reusable rubric. Now we need the machinery to actually score outputs against it.

Scoring Outputs with an LLM Grader

The core idea: send each LLM output to a grading model along with the rubric criteria, and ask it to return structured scores. Using response_format with a JSON schema keeps the grader’s output parseable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import json
from openai import OpenAI

client = OpenAI()

def build_grading_prompt(rubric: Rubric, prompt_text: str, output_text: str) -> str:
    criteria_block = "\n".join(
        f"- {c.name} ({c.scale_min}-{c.scale_max}): {c.description}"
        for c in rubric.criteria
    )
    return f"""You are an evaluation assistant. Score the following LLM output against each criterion.

PROMPT GIVEN TO THE LLM:
{prompt_text}

LLM OUTPUT:
{output_text}

RUBRIC CRITERIA (score each on the given scale):
{criteria_block}

Return a JSON object with a key "scores" containing an object mapping each criterion name to an integer score, and a key "reasoning" mapping each criterion name to a one-sentence justification."""


def grade_output(rubric: Rubric, prompt_text: str, output_text: str) -> dict:
    grading_prompt = build_grading_prompt(rubric, prompt_text, output_text)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a strict evaluation grader. Return only valid JSON."},
            {"role": "user", "content": grading_prompt},
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

Setting temperature=0.0 makes scores more deterministic across runs. The json_object response format guarantees you get parseable JSON back instead of markdown-wrapped output.

Test it with a quick call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
result = grade_output(
    rubric=support_rubric,
    prompt_text="A customer asks: 'How do I reset my password?'",
    output_text="Click 'Forgot Password' on the login page, enter your email, and follow the reset link sent to your inbox. The link expires in 24 hours.",
)
print(json.dumps(result, indent=2))
# {
#   "scores": {"accuracy": 5, "tone": 4, "completeness": 5, "conciseness": 5},
#   "reasoning": {"accuracy": "The steps are correct and standard for password reset flows.", ...}
# }

Running a Full Evaluation Pipeline

A single score is useful. A pipeline that runs multiple test cases across multiple prompt variants and collects all the results is what you actually need. Here’s a pipeline runner that takes a list of test cases and prompt variants, then scores everything:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from dataclasses import dataclass

@dataclass
class TestCase:
    name: str
    user_prompt: str
    expected_context: str  # optional ground truth or context for the grader

@dataclass
class PromptVariant:
    name: str
    system_prompt: str

@dataclass
class EvalResult:
    test_case: str
    variant: str
    scores: dict[str, int]
    reasoning: dict[str, str]


def run_pipeline(
    rubric: Rubric,
    variants: list[PromptVariant],
    test_cases: list[TestCase],
    model: str = "gpt-4o-mini",
) -> list[EvalResult]:
    results = []

    for case in test_cases:
        for variant in variants:
            # Generate the output from the variant
            gen_response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": variant.system_prompt},
                    {"role": "user", "content": case.user_prompt},
                ],
                temperature=0.7,
            )
            output_text = gen_response.choices[0].message.content

            # Grade it
            grade = grade_output(rubric, case.user_prompt, output_text)

            results.append(EvalResult(
                test_case=case.name,
                variant=variant.name,
                scores=grade.get("scores", {}),
                reasoning=grade.get("reasoning", {}),
            ))
            print(f"  Graded: {case.name} x {variant.name} -> {grade.get('scores', {})}")

    return results

Run it with two prompt variants and a few test cases:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
variants = [
    PromptVariant(name="formal", system_prompt="You are a professional customer support agent. Be formal and precise."),
    PromptVariant(name="casual", system_prompt="You are a friendly support agent. Be casual and warm."),
]

test_cases = [
    TestCase(name="password_reset", user_prompt="How do I reset my password?", expected_context="Standard password reset flow"),
    TestCase(name="billing_dispute", user_prompt="I was charged twice for my subscription.", expected_context="Billing error resolution"),
    TestCase(name="feature_request", user_prompt="Can you add dark mode to the app?", expected_context="Feature request handling"),
]

results = run_pipeline(support_rubric, variants, test_cases)

This gives you 6 evaluated results (3 test cases x 2 variants), each with per-criterion scores and reasoning.

Aggregating Results and Generating Reports

Raw scores need aggregation to be useful. You want averages per variant, per criterion, and a way to spot which variant wins on what dimension.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from collections import defaultdict

def aggregate_results(results: list[EvalResult], rubric: Rubric) -> dict:
    variant_scores = defaultdict(lambda: defaultdict(list))

    for r in results:
        for criterion_name, score in r.scores.items():
            variant_scores[r.variant][criterion_name].append(score)

    report = {}
    for variant, criteria in variant_scores.items():
        report[variant] = {}
        for criterion_name, scores in criteria.items():
            avg = sum(scores) / len(scores)
            report[variant][criterion_name] = {
                "mean": round(avg, 2),
                "min": min(scores),
                "max": max(scores),
                "n": len(scores),
            }
        # Overall average across all criteria
        all_scores = [s for scores in criteria.values() for s in scores]
        report[variant]["_overall"] = round(sum(all_scores) / len(all_scores), 2)

    return report


def print_report(report: dict):
    for variant, criteria in report.items():
        overall = criteria.pop("_overall", "N/A")
        print(f"\n{'='*50}")
        print(f"Variant: {variant} (Overall: {overall})")
        print(f"{'='*50}")
        for criterion, stats in criteria.items():
            print(f"  {criterion:20s}  mean={stats['mean']:.2f}  min={stats['min']}  max={stats['max']}  n={stats['n']}")


report = aggregate_results(results, support_rubric)
print_report(report)
# ==================================================
# Variant: formal (Overall: 4.25)
# ==================================================
#   accuracy              mean=4.67  min=4  max=5  n=3
#   tone                  mean=4.00  min=3  max=5  n=3
#   completeness          mean=4.33  min=4  max=5  n=3
#   conciseness           mean=4.00  min=3  max=5  n=3
#
# ==================================================
# Variant: casual (Overall: 4.08)
# ==================================================
#   ...

You can also dump the raw report dict to JSON or CSV for tracking scores over time. If you wire this into CI, you get prompt regression detection for free – any new prompt version that drops below a threshold fails the build.

Customizing Rubrics for Different Tasks

The same pipeline works for any task. Just swap the rubric. Here’s one for evaluating code generation quality:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
code_rubric = Rubric(
    name="code_generation_quality",
    criteria=[
        RubricCriterion(
            name="correctness",
            description="Does the generated code produce the correct output for the stated task?",
        ),
        RubricCriterion(
            name="readability",
            description="Is the code well-structured with clear variable names and appropriate comments?",
        ),
        RubricCriterion(
            name="efficiency",
            description="Does the code avoid unnecessary operations and use appropriate data structures?",
        ),
        RubricCriterion(
            name="error_handling",
            description="Does the code handle edge cases and potential errors gracefully?",
        ),
        RubricCriterion(
            name="follows_instructions",
            description="Does the code implement exactly what was asked, without adding unrequested features?",
        ),
    ],
)

You can also adjust the scoring scale. For binary pass/fail criteria, set scale_min=0 and scale_max=1. For more granular differentiation, use a 1-10 scale. The grading prompt adapts automatically because it reads the scale from the rubric definition.

One thing worth noting: the grader model matters. Using gpt-4o as the grader produces more consistent and calibrated scores than smaller models. If cost is a concern, generate outputs with gpt-4o-mini but always grade with gpt-4o.

Common Errors and Fixes

Grader returns scores outside the rubric scale. Even with temperature=0.0, LLMs sometimes return a 6 on a 1-5 scale. Clamp scores after parsing:

1
2
3
4
5
6
7
8
def clamp_scores(scores: dict, rubric: Rubric) -> dict:
    criteria_map = {c.name: c for c in rubric.criteria}
    clamped = {}
    for name, score in scores.items():
        if name in criteria_map:
            c = criteria_map[name]
            clamped[name] = max(c.scale_min, min(c.scale_max, score))
    return clamped

JSON parsing fails on grader response. Some models occasionally wrap JSON in markdown code fences. Strip them before parsing:

1
2
3
4
5
6
7
8
def safe_parse_json(text: str) -> dict:
    cleaned = text.strip()
    if cleaned.startswith("```"):
        # Remove markdown code fences
        lines = cleaned.split("\n")
        lines = [l for l in lines if not l.strip().startswith("```")]
        cleaned = "\n".join(lines)
    return json.loads(cleaned)

Scores are inconsistent across runs. Even at temperature=0.0, API responses can vary slightly. Run the grader 3 times per evaluation and take the median score for each criterion. This adds cost but significantly improves reliability.

Missing criteria in grader output. The grader sometimes omits a criterion, especially with longer rubrics. Validate the response and re-prompt if any criteria are missing:

1
2
3
4
def validate_grade(grade: dict, rubric: Rubric) -> bool:
    expected_criteria = {c.name for c in rubric.criteria}
    returned_criteria = set(grade.get("scores", {}).keys())
    return expected_criteria == returned_criteria

Rate limit errors during large evaluation runs. When running hundreds of evaluations, you’ll hit OpenAI rate limits. Add exponential backoff or use the tenacity library:

1
2
3
4
5
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, max=30))
def grade_output_with_retry(rubric, prompt_text, output_text):
    return grade_output(rubric, prompt_text, output_text)

The full pattern gives you a reusable, extensible evaluation system. Define rubrics as data, run pipelines against test suites, aggregate scores, and track quality over time. When a prompt change regresses one dimension, you catch it before your users do.