Changing a single word in a prompt can tank your LLM app’s accuracy. Without versioning and regression tests, you’re flying blind every time you edit a system prompt. The fix is straightforward: store prompts as YAML files in git, build a test suite that runs your prompts against known inputs, and compare outputs across versions.

Here’s a minimal prompt file and the code to load it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# prompts/summarize/v2.yaml
name: summarize
version: "2"
model: gpt-4o-mini
temperature: 0.3
system: |
  You are a technical summarizer. Summarize the given text in exactly
  3 bullet points. Each bullet must be one sentence. Focus on actionable
  information and skip background context.
test_cases:
  - input: "Kubernetes 1.30 introduces new sidecar container support, allowing init containers to run for the lifetime of a pod. This simplifies service mesh and logging agent patterns that previously required workarounds. The feature graduated to beta and is enabled by default."
    expected_contains:
      - "sidecar"
      - "init container"
    min_bullets: 3
    max_bullets: 3
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# prompt_loader.py
import yaml
from pathlib import Path
from dataclasses import dataclass


@dataclass
class PromptVersion:
    name: str
    version: str
    model: str
    temperature: float
    system: str
    test_cases: list[dict]


def load_prompt(prompt_path: str) -> PromptVersion:
    path = Path(prompt_path)
    with open(path) as f:
        data = yaml.safe_load(f)
    return PromptVersion(
        name=data["name"],
        version=data["version"],
        model=data["model"],
        temperature=data["temperature"],
        system=data["system"],
        test_cases=data.get("test_cases", []),
    )

This gives you git-diffable prompt files with embedded test cases. Every change shows up in your commit history with full context.

Setting Up the Regression Test Runner

The core idea: call the LLM with your prompt, then check the output against expected behaviors. Not exact string matching (LLM outputs vary), but structural and semantic checks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# test_runner.py
from openai import OpenAI
from prompt_loader import load_prompt

client = OpenAI()


def run_prompt(prompt: PromptVersion, user_input: str) -> str:
    response = client.chat.completions.create(
        model=prompt.model,
        temperature=prompt.temperature,
        messages=[
            {"role": "system", "content": prompt.system},
            {"role": "user", "content": user_input},
        ],
    )
    return response.choices[0].message.content


def check_contains(output: str, expected_terms: list[str]) -> dict:
    results = {}
    lower_output = output.lower()
    for term in expected_terms:
        results[term] = term.lower() in lower_output
    return results


def check_bullet_count(output: str, min_count: int, max_count: int) -> bool:
    bullets = [line.strip() for line in output.split("\n") if line.strip().startswith(("-", "*", "1", "2", "3"))]
    return min_count <= len(bullets) <= max_count


def score_test_case(prompt: PromptVersion, test_case: dict) -> dict:
    output = run_prompt(prompt, test_case["input"])

    contains_results = {}
    if "expected_contains" in test_case:
        contains_results = check_contains(output, test_case["expected_contains"])

    bullet_ok = True
    if "min_bullets" in test_case:
        bullet_ok = check_bullet_count(
            output, test_case["min_bullets"], test_case["max_bullets"]
        )

    all_passed = all(contains_results.values()) and bullet_ok

    return {
        "output": output,
        "contains": contains_results,
        "bullet_count_ok": bullet_ok,
        "passed": all_passed,
    }

Writing Pytest Regression Tests

Wrap the runner in pytest so it plugs into your CI pipeline. Each test case in the YAML becomes a parameterized test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# tests/test_prompts.py
import pytest
from prompt_loader import load_prompt
from test_runner import score_test_case

PROMPT_FILES = [
    "prompts/summarize/v2.yaml",
    "prompts/classify/v1.yaml",
]


def gather_test_cases():
    cases = []
    for path in PROMPT_FILES:
        prompt = load_prompt(path)
        for i, tc in enumerate(prompt.test_cases):
            cases.append(
                pytest.param(prompt, tc, id=f"{prompt.name}-v{prompt.version}-case{i}")
            )
    return cases


@pytest.mark.parametrize("prompt,test_case", gather_test_cases())
def test_prompt_regression(prompt, test_case):
    result = score_test_case(prompt, test_case)

    for term, found in result["contains"].items():
        assert found, f"Expected term '{term}' not found in output: {result['output'][:200]}"

    assert result["bullet_count_ok"], f"Bullet count outside range in output: {result['output'][:200]}"

Run it with:

1
pytest tests/test_prompts.py -v --tb=short

When you change a prompt, the test suite catches regressions. If v3 of your summarize prompt suddenly stops producing bullet points, the test fails before you ship it.

Adding Semantic Similarity Scoring

Keyword checks catch structural issues but miss meaning drift. Add a semantic similarity check using sentence embeddings to catch cases where the output says the right thing differently, or says the wrong thing with the right keywords.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# semantic_scorer.py
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("all-MiniLM-L6-v2")


def semantic_similarity(text_a: str, text_b: str) -> float:
    embeddings = model.encode([text_a, text_b], convert_to_tensor=True)
    score = util.cos_sim(embeddings[0], embeddings[1]).item()
    return score


def passes_similarity_threshold(output: str, reference: str, threshold: float = 0.75) -> bool:
    score = semantic_similarity(output, reference)
    return score >= threshold

Add a reference_output field to your YAML test cases and wire it into the test:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# In tests/test_prompts.py, add to the test function:
from semantic_scorer import passes_similarity_threshold


@pytest.mark.parametrize("prompt,test_case", gather_test_cases())
def test_prompt_semantic_regression(prompt, test_case):
    if "reference_output" not in test_case:
        pytest.skip("No reference output for semantic test")

    result = score_test_case(prompt, test_case)
    sim_ok = passes_similarity_threshold(
        result["output"],
        test_case["reference_output"],
        threshold=test_case.get("similarity_threshold", 0.75),
    )
    assert sim_ok, (
        f"Semantic similarity below threshold. "
        f"Output: {result['output'][:200]}"
    )

Comparing Outputs Across Prompt Versions

When you create a new prompt version, run both side by side and compare. This is the most useful pattern for catching subtle regressions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# compare_versions.py
from prompt_loader import load_prompt
from test_runner import run_prompt
from semantic_scorer import semantic_similarity


def compare_prompt_versions(old_path: str, new_path: str):
    old_prompt = load_prompt(old_path)
    new_prompt = load_prompt(new_path)

    print(f"Comparing {old_prompt.name} v{old_prompt.version} -> v{new_prompt.version}")
    print("=" * 60)

    for i, tc in enumerate(old_prompt.test_cases):
        old_output = run_prompt(old_prompt, tc["input"])
        new_output = run_prompt(new_prompt, tc["input"])

        sim = semantic_similarity(old_output, new_output)

        print(f"\nTest case {i}:")
        print(f"  Old output: {old_output[:150]}...")
        print(f"  New output: {new_output[:150]}...")
        print(f"  Similarity: {sim:.3f}")

        if sim < 0.7:
            print("  WARNING: Significant output drift detected")


if __name__ == "__main__":
    compare_prompt_versions("prompts/summarize/v1.yaml", "prompts/summarize/v2.yaml")

A clean directory structure for this setup looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
project/
  prompts/
    summarize/
      v1.yaml
      v2.yaml
    classify/
      v1.yaml
  tests/
    test_prompts.py
  prompt_loader.py
  test_runner.py
  semantic_scorer.py
  compare_versions.py

Each prompt gets its own directory. New versions are new files. Git handles the history, and you can always diff v1.yaml against v2.yaml to see exactly what changed.

Common Errors and Fixes

yaml.scanner.ScannerError: mapping values are not allowed here – Your YAML multiline string is probably missing the | pipe character. System prompts need system: | followed by indented text on the next line.

Tests pass locally but fail in CI – LLM outputs are non-deterministic even at low temperature. Set temperature: 0 for regression tests, and use broad checks (keyword contains) rather than exact match. If you need deterministic tests during development, cache LLM responses to disk and replay them.

openai.RateLimitError during test runs – Add a short sleep between test cases or use pytest-xdist with a worker count of 1. For large suites, batch requests or use the OpenAI Batch API.

Semantic similarity scores are low for correct outputs – The all-MiniLM-L6-v2 model works well for general English. If your domain uses heavy jargon, fine-tune the embedding model or raise the threshold only for well-understood test cases. A threshold of 0.70-0.80 is a reasonable starting range.

ModuleNotFoundError: No module named 'sentence_transformers' – Install with pip install sentence-transformers. This pulls in torch as a dependency, so it adds about 2GB. For CI, consider caching the pip install or using a pre-built Docker image.