Shipping an LLM prompt change without testing it against the current version is like deploying code without running the test suite. You might get lucky. You probably won’t. A/B testing LLM prompts gives you real numbers – quality scores, latency, cost – so you stop guessing which prompt “feels better” and start measuring which one actually performs.

The core pattern is straightforward: split production traffic between prompt variants, score every response with automated evaluators, and use statistical tests to determine whether the difference is real or noise.

The Minimal A/B Test Setup

You don’t need a platform to start. Here’s a self-contained A/B testing harness that logs everything you need for analysis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import random
import time
import json
import hashlib
from dataclasses import dataclass, asdict
from openai import OpenAI

client = OpenAI()

VARIANTS = {
    "control": {
        "system": "You are a helpful assistant. Answer the user's question clearly and concisely.",
        "model": "gpt-4o",
    },
    "treatment": {
        "system": (
            "You are a senior technical writer. Answer the user's question with "
            "precision. Use bullet points for multi-part answers. Cite sources when possible."
        ),
        "model": "gpt-4o",
    },
}

@dataclass
class ABResult:
    variant: str
    user_id: str
    prompt_hash: str
    input_text: str
    output_text: str
    latency_ms: float
    input_tokens: int
    output_tokens: int
    total_cost: float
    quality_score: float | None = None

def assign_variant(user_id: str) -> str:
    """Deterministic assignment -- same user always gets the same variant."""
    h = int(hashlib.sha256(user_id.encode()).hexdigest(), 16)
    return "treatment" if h % 100 < 50 else "control"

def run_ab_call(user_id: str, user_input: str) -> ABResult:
    variant_name = assign_variant(user_id)
    variant = VARIANTS[variant_name]

    start = time.perf_counter()
    response = client.chat.completions.create(
        model=variant["model"],
        messages=[
            {"role": "system", "content": variant["system"]},
            {"role": "user", "content": user_input},
        ],
    )
    latency = (time.perf_counter() - start) * 1000

    usage = response.usage
    # Approximate cost for gpt-4o: $2.50/1M input, $10/1M output
    cost = (usage.prompt_tokens * 2.50 + usage.completion_tokens * 10.0) / 1_000_000

    return ABResult(
        variant=variant_name,
        user_id=user_id,
        prompt_hash=hashlib.sha256(variant["system"].encode()).hexdigest()[:12],
        input_text=user_input,
        output_text=response.choices[0].message.content,
        latency_ms=round(latency, 2),
        input_tokens=usage.prompt_tokens,
        output_tokens=usage.completion_tokens,
        total_cost=round(cost, 6),
    )

The assign_variant function uses a hash of the user ID for deterministic assignment. This matters: if you use random.choice per request, the same user might see variant A on one call and variant B on the next, which contaminates your results. Hash-based splitting ensures consistent user experience and clean experiment data.

Scoring Responses Automatically

Raw A/B data is useless without quality scores. You need automated evaluators that grade every response. LLM-as-a-judge is the most practical approach for subjective quality – use a separate model to score outputs on dimensions like accuracy, helpfulness, and formatting.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
JUDGE_PROMPT = """Rate the following AI response on a scale of 1-5 for each criterion.
Return ONLY a JSON object with numeric scores.

User question: {question}
AI response: {response}

Criteria:
- accuracy: Is the information correct? (1=wrong, 5=fully correct)
- helpfulness: Does it actually answer the question? (1=useless, 5=perfectly helpful)
- conciseness: Is it appropriately concise? (1=bloated, 5=precisely right length)
"""

def score_response(question: str, response: str) -> dict[str, int]:
    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are an impartial quality evaluator."},
            {
                "role": "user",
                "content": JUDGE_PROMPT.format(question=question, response=response),
            },
        ],
        response_format={"type": "json_object"},
    )
    try:
        scores = json.loads(result.choices[0].message.content)
        return {k: int(v) for k, v in scores.items() if k in ("accuracy", "helpfulness", "conciseness")}
    except (json.JSONDecodeError, ValueError) as e:
        print(f"Judge failed to return valid JSON: {e}")
        return {"accuracy": 0, "helpfulness": 0, "conciseness": 0}

Use a cheaper, faster model for the judge (like gpt-4o-mini) to keep costs down. The judge evaluates every response from both variants using the same criteria, so you’re comparing apples to apples.

A common mistake: using the same model as both the test subject and the judge. This creates bias – GPT-4o tends to rate GPT-4o outputs more favorably than Claude does, and vice versa. If you’re testing across model families, use a third model as judge or combine automated scoring with human evaluation.

Using Langfuse for Managed A/B Tests

If you don’t want to build logging and dashboards from scratch, Langfuse handles prompt versioning, traffic splitting, and metric tracking out of the box. Create two labeled versions of the same prompt and let your app randomly select between them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from langfuse import get_client
import random
from openai import OpenAI

langfuse = get_client()
client = OpenAI()

# Pull both prompt variants from Langfuse's prompt registry
prompt_a = langfuse.get_prompt("summarizer", label="control")
prompt_b = langfuse.get_prompt("summarizer", label="treatment")

def handle_request(user_input: str) -> str:
    selected = random.choice([prompt_a, prompt_b])

    # The langfuse_prompt kwarg links the response trace to this prompt version
    compiled = selected.compile(input_text=user_input)

    generation = langfuse.generation(
        name="ab-test-summarizer",
        input=user_input,
        metadata={"prompt_label": selected.label},
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": compiled}],
    )

    output = response.choices[0].message.content
    generation.end(output=output)

    return output

Langfuse automatically tracks latency, token usage, and cost per prompt version. You can then filter by prompt label in the Langfuse dashboard to compare variants side by side. Add custom scores (from your LLM judge or user feedback) by calling generation.score(name="accuracy", value=4) on each trace.

Statistical Analysis: Picking the Winner

Here’s where most teams get it wrong. They look at the average score for each variant and pick the higher one. That’s not a test – that’s a coin flip. You need to verify the difference is statistically significant.

For continuous metrics like quality scores, use the Mann-Whitney U test (it doesn’t assume normal distributions, which LLM scores rarely follow):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from scipy import stats

def analyze_ab_results(
    scores_a: list[float], scores_b: list[float], alpha: float = 0.05
) -> dict:
    """Compare two variants using Mann-Whitney U test and bootstrap CI."""
    stat, p_value = stats.mannwhitneyu(scores_a, scores_b, alternative="two-sided")

    mean_a, mean_b = np.mean(scores_a), np.mean(scores_b)
    diff = mean_b - mean_a

    # Bootstrap 95% confidence interval for the difference in means
    n_bootstrap = 10_000
    diffs = []
    for _ in range(n_bootstrap):
        sample_a = np.random.choice(scores_a, size=len(scores_a), replace=True)
        sample_b = np.random.choice(scores_b, size=len(scores_b), replace=True)
        diffs.append(np.mean(sample_b) - np.mean(sample_a))

    ci_lower = np.percentile(diffs, 2.5)
    ci_upper = np.percentile(diffs, 97.5)

    return {
        "mean_control": round(mean_a, 4),
        "mean_treatment": round(mean_b, 4),
        "difference": round(diff, 4),
        "p_value": round(p_value, 4),
        "ci_95": (round(ci_lower, 4), round(ci_upper, 4)),
        "significant": p_value < alpha,
        "sample_sizes": (len(scores_a), len(scores_b)),
    }

# Example usage
control_scores = [4.2, 3.8, 4.5, 4.0, 3.9, 4.1, 4.3, 3.7, 4.4, 4.0]  # from your logs
treatment_scores = [4.5, 4.3, 4.7, 4.1, 4.6, 4.4, 4.8, 4.2, 4.5, 4.3]

result = analyze_ab_results(control_scores, treatment_scores)
print(f"Control mean: {result['mean_control']}")
print(f"Treatment mean: {result['mean_treatment']}")
print(f"Difference: {result['difference']}")
print(f"p-value: {result['p_value']}")
print(f"95% CI: {result['ci_95']}")
print(f"Statistically significant: {result['significant']}")

A p-value below 0.05 means there’s less than a 5% chance the observed difference is due to random chance. But also look at the confidence interval – if it’s wide (e.g., -0.1 to +0.8), you don’t have enough data yet. Narrow CIs give you confidence in the magnitude of the effect, not just its existence.

Sample Size: How Many Calls You Actually Need

The biggest mistake in LLM A/B testing is calling it too early. LLM outputs are stochastic – the same prompt with the same input can produce different quality scores across runs. You need enough samples to see through this noise.

A rough guide: start with at least 100 scored responses per variant for quality metrics. For high-variance tasks (creative writing, open-ended Q&A), aim for 200-500. For low-variance tasks (classification, extraction), 50-100 might suffice.

You can compute the required sample size before starting:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from scipy.stats import norm

def required_sample_size(
    baseline_mean: float,
    expected_lift: float,
    std_dev: float,
    alpha: float = 0.05,
    power: float = 0.8,
) -> int:
    """Minimum samples per variant for a two-sample test."""
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    effect_size = expected_lift / std_dev
    n = 2 * ((z_alpha + z_beta) / effect_size) ** 2
    return int(np.ceil(n))

# Example: detecting a 0.3-point lift on a 1-5 scale with std_dev=0.5
n = required_sample_size(baseline_mean=3.8, expected_lift=0.3, std_dev=0.5)
print(f"Need {n} samples per variant")  # ~44 per variant

If you’re expecting a small improvement (0.1 points on a 5-point scale), you’ll need hundreds of samples. If the improvement is large (0.5+ points), 30-50 per variant might be enough. Don’t peek at results and stop early when the numbers look good – that inflates your false positive rate.

Common Pitfalls and How to Fix Them

Inconsistent user assignment. If you use random.choice per request instead of deterministic hashing, the same user bounces between variants. This adds noise and makes per-user analysis impossible. Always hash on a stable identifier (user ID, session ID).

Testing too many things at once. Changing the system prompt, the model, and the temperature simultaneously means you can’t attribute improvements to any single change. Test one variable at a time, or use multivariate testing frameworks that can decompose effects.

Ignoring cost and latency. A prompt variant that scores 5% higher on quality but costs 3x more or adds 2 seconds of latency might not be worth it. Track all three metrics and make decisions on the composite picture.

Judge model drift. If your LLM judge model gets updated mid-experiment, scores from before and after the update aren’t comparable. Pin your judge to a specific model version (e.g., gpt-4o-mini-2024-07-18 instead of gpt-4o-mini) for the duration of the test.

No baseline validation. Before running an A/B test, run your evaluation pipeline on the same variant twice (A/A test). If the A/A test shows a significant difference, your evaluation methodology is broken – fix that before testing real changes.

Running A/B Tests in CI

Once you have automated evaluation working, integrate it into your deployment pipeline. Run every prompt change against a golden dataset before it hits production:

1
2
3
4
# In your CI pipeline
python run_eval.py --variant control --dataset golden_set.jsonl --output results_control.json
python run_eval.py --variant treatment --dataset golden_set.jsonl --output results_treatment.json
python compare_results.py --control results_control.json --treatment results_treatment.json --threshold 0.05

If the treatment variant doesn’t show a statistically significant improvement (or shows a regression), block the deployment. This catches prompt regressions the same way unit tests catch code regressions.

Tools like promptfoo can automate this entire flow with a YAML config that defines prompts, test cases, and assertions. It integrates with CI/CD systems and produces comparison reports.

When to Stop the Test

End the experiment when one of these conditions is met:

  • You’ve reached your pre-computed sample size and the result is significant (ship the winner)
  • You’ve reached your sample size and the result is not significant (the variants perform the same – keep whichever is cheaper or faster)
  • The treatment variant is clearly worse (quality dropped significantly) – kill it early to protect user experience

Don’t run tests indefinitely. Set a maximum duration (e.g., 2 weeks) and commit to a decision at the end. Perpetual experiments waste traffic on suboptimal prompts.