String matching and regex checks break the moment your LLM rephrases something. You tweak a prompt, the output is still correct, but your brittle assert "specific phrase" in response fails anyway. The better approach: use a second LLM call as a judge to evaluate whether the output actually meets your criteria. This gives you semantic evaluation that handles paraphrasing, format variations, and edge cases that rule-based checks miss.

Here’s the core idea in code. You define test cases, run your prompt, then ask a judge model to score the output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# prompt_regression.py
from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class TestCase:
    name: str
    user_input: str
    expected_behavior: str

@dataclass
class JudgeResult:
    test_name: str
    score: int  # 1-5
    reasoning: str
    passed: bool

JUDGE_SYSTEM_PROMPT = """You are an evaluation judge. Given a prompt's output and a description of expected behavior, score the output from 1 to 5:

5 = Fully meets expectations, no issues
4 = Mostly meets expectations, minor gaps
3 = Partially meets expectations, noticeable issues
2 = Barely meets expectations, significant problems
1 = Does not meet expectations at all

Respond in exactly this format:
SCORE: <number>
REASONING: <one paragraph explanation>"""


def run_prompt(system_prompt: str, user_input: str, model: str = "gpt-4o-mini") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_input},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content


def judge_output(output: str, expected_behavior: str, judge_model: str = "gpt-4o") -> tuple[int, str]:
    judge_input = f"""## LLM Output
{output}

## Expected Behavior
{expected_behavior}"""

    response = client.chat.completions.create(
        model=judge_model,
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
            {"role": "user", "content": judge_input},
        ],
        temperature=0.0,
    )
    result = response.choices[0].message.content

    score_line = [l for l in result.strip().splitlines() if l.startswith("SCORE:")][0]
    score = int(score_line.split(":")[1].strip())
    reasoning_line = [l for l in result.strip().splitlines() if l.startswith("REASONING:")][0]
    reasoning = reasoning_line.split(":", 1)[1].strip()

    return score, reasoning

Running Test Cases and Aggregating Scores

Wrap the judge into a test runner that iterates through cases, collects scores, and returns a pass/fail verdict. The PASS_THRESHOLD is the minimum average score across all test cases. A threshold of 4.0 works well for most production prompts – it allows minor variation but catches real regressions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# test_runner.py
from prompt_regression import run_prompt, judge_output, TestCase, JudgeResult

PASS_THRESHOLD = 4.0

def run_regression_suite(
    system_prompt: str,
    test_cases: list[TestCase],
    model: str = "gpt-4o-mini",
    judge_model: str = "gpt-4o",
) -> tuple[bool, list[JudgeResult]]:
    results = []

    for case in test_cases:
        output = run_prompt(system_prompt, case.user_input, model=model)
        score, reasoning = judge_output(output, case.expected_behavior, judge_model=judge_model)
        passed = score >= PASS_THRESHOLD
        results.append(JudgeResult(
            test_name=case.name,
            score=score,
            reasoning=reasoning,
            passed=passed,
        ))
        print(f"  {'PASS' if passed else 'FAIL'} [{score}/5] {case.name}")
        if not passed:
            print(f"        Reason: {reasoning}")

    avg_score = sum(r.score for r in results) / len(results)
    suite_passed = avg_score >= PASS_THRESHOLD
    print(f"\nAverage score: {avg_score:.2f} (threshold: {PASS_THRESHOLD})")
    print(f"Suite result: {'PASSED' if suite_passed else 'FAILED'}")

    return suite_passed, results


if __name__ == "__main__":
    # Define your prompt under test
    system_prompt = """You are a technical assistant. Answer programming questions
    concisely with a code example. Keep answers under 200 words."""

    # Define test cases with semantic expectations
    test_cases = [
        TestCase(
            name="python_list_comprehension",
            user_input="How do I filter a list in Python?",
            expected_behavior="Should explain list comprehensions or filter(), include a working Python code snippet, and stay under 200 words.",
        ),
        TestCase(
            name="error_handling_advice",
            user_input="How should I handle errors in a REST API?",
            expected_behavior="Should mention HTTP status codes, try/except or error middleware, and include a code example in any common web framework.",
        ),
        TestCase(
            name="refusal_on_offtopic",
            user_input="What's the best pizza in New York?",
            expected_behavior="Should politely decline or redirect, since the system prompt scopes it to programming questions only.",
        ),
    ]

    passed, results = run_regression_suite(system_prompt, test_cases)
    exit(0 if passed else 1)

Run it directly with python test_runner.py. The exit code is 0 for pass, 1 for fail, which makes CI integration straightforward.

Integrating with CI/CD

Drop this into a GitHub Actions workflow. The key is setting OPENAI_API_KEY as a repository secret and running the test suite as a step that can fail the build:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# .github/workflows/prompt-regression.yml
name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - "prompts/**"
      - "tests/prompt_regression/**"

jobs:
  test-prompts:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install openai
      - run: python tests/prompt_regression/test_runner.py
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The paths filter means this only runs when prompt files or test files change, so you’re not burning API credits on unrelated PRs. Keep your test cases small and focused – 5 to 10 cases per prompt is enough to catch regressions without running up a huge bill. Using gpt-4o-mini as the target model and gpt-4o as the judge keeps costs manageable while still getting reliable evaluations.

Improving Judge Reliability

The judge prompt above works for general cases, but you’ll get more consistent scores with a few tweaks. Use structured output to eliminate parsing failures, and add a rubric specific to your use case:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# structured_judge.py
import json
from openai import OpenAI

client = OpenAI()

def judge_with_structured_output(
    output: str,
    expected_behavior: str,
    rubric: str = "",
    judge_model: str = "gpt-4o",
) -> dict:
    rubric_section = f"\n## Rubric\n{rubric}" if rubric else ""

    judge_input = f"""## LLM Output
{output}

## Expected Behavior
{expected_behavior}
{rubric_section}

Return a JSON object with keys: "score" (integer 1-5), "reasoning" (string)."""

    response = client.chat.completions.create(
        model=judge_model,
        messages=[
            {"role": "system", "content": "You are a strict evaluation judge. Return only valid JSON."},
            {"role": "user", "content": judge_input},
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )

    return json.loads(response.choices[0].message.content)


# Usage with a custom rubric
result = judge_with_structured_output(
    output="Here's how to filter a list in Python: use a list comprehension like [x for x in items if x > 5].",
    expected_behavior="Should include a code example for filtering lists in Python.",
    rubric="""Deduct points for:
- Missing code block formatting (-1)
- No explanation of syntax (-1)
- Factual errors (-2)""",
)

print(f"Score: {result['score']}/5")
print(f"Reasoning: {result['reasoning']}")

Using response_format={"type": "json_object"} forces the model to return valid JSON every time, which eliminates the fragile string parsing from the first example. The custom rubric gives the judge explicit criteria, which reduces score variance across runs. In practice, you’ll want to run each judge call 2-3 times and take the median score for anything where you need high confidence.

Common Errors and Fixes

Judge scores are inconsistent across runs. Set temperature=0.0 on the judge call. If you still see variance, run the judge 3 times per test case and take the median. The gpt-4o model at temperature 0 is more deterministic than smaller models.

Parsing fails on the judge response. Switch to structured output with response_format={"type": "json_object"} as shown above. The plain-text SCORE/REASONING format works for prototyping but breaks when the model decides to add extra commentary.

All test cases pass trivially. Your expected behavior descriptions are probably too vague. “Should give a good answer” will always score 4+. Be specific: “Should include a Python code block with a try/except statement and mention at least two specific HTTP status codes (e.g., 400, 500).”

Tests are too slow. Run test cases in parallel with asyncio and the async OpenAI client (from openai import AsyncOpenAI). You can also use gpt-4o-mini as the judge for faster, cheaper runs during development, and switch to gpt-4o for the CI gate.

OpenAI rate limits during CI. Add retry logic with exponential backoff. The openai library handles transient errors automatically, but if you’re running many test cases in parallel, add a small delay between batches or use max_retries when initializing the client: client = OpenAI(max_retries=3).

Exit code is always 0 even when tests fail. Make sure your CI step runs the script directly (python test_runner.py), not through a shell wrapper that swallows the exit code. The exit(0 if passed else 1) in the runner handles this.