How to Build a Model Canary Analysis Pipeline with Statistical Tests

You shipped a canary. Traffic is flowing to both the baseline and the new model version. Now what? Eyeballing dashboards is not a deployment strategy. You need statistical hypothesis tests that tell you whether the canary is actually better, actually worse, or just noise.

Here is the core idea: collect prediction confidence scores from both versions, run non-parametric statistical tests, compute effect sizes, and make an automated promote/rollback decision. All exposed through a single FastAPI endpoint.

1
pip install fastapi uvicorn scipy numpy pydantic

1
2
3
4
5
6
7
8
9
# canary_analysis.py
import numpy as np
from scipy import stats

baseline_scores = np.random.normal(loc=0.85, scale=0.05, size=500)
canary_scores = np.random.normal(loc=0.87, scale=0.05, size=500)

u_stat, u_pvalue = stats.mannwhitneyu(baseline_scores, canary_scores, alternative="two-sided")
print(f"Mann-Whitney U: statistic={u_stat:.1f}, p-value={u_pvalue:.4f}")

That gives you a quick sanity check. If u_pvalue < 0.05, the distributions differ significantly. But a single test is not enough for a production decision. You want multiple tests, effect sizes, and confidence intervals before you promote anything.

Collecting Prediction Distributions

Before you can run any analysis, you need to store prediction outputs from both model versions. Keep it simple: two lists of floats representing whatever metric you care about most. Usually that is prediction confidence, latency, or a custom business score.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# metrics_collector.py
from dataclasses import dataclass, field
from threading import Lock
from typing import Optional


@dataclass
class CanaryMetricsCollector:
    baseline_scores: list[float] = field(default_factory=list)
    canary_scores: list[float] = field(default_factory=list)
    _lock: Lock = field(default_factory=Lock, repr=False)

    def record(self, version: str, score: float) -> None:
        with self._lock:
            if version == "baseline":
                self.baseline_scores.append(score)
            elif version == "canary":
                self.canary_scores.append(score)

    def sample_counts(self) -> dict[str, int]:
        return {
            "baseline": len(self.baseline_scores),
            "canary": len(self.canary_scores),
        }

    def clear(self) -> None:
        with self._lock:
            self.baseline_scores.clear()
            self.canary_scores.clear()

Thread-safe, simple, no database. In production you would back this with Redis or a time-series store, but for the analysis pipeline itself, in-memory lists work fine as long as you flush them after each analysis window.

Running Multiple Statistical Tests

One p-value is not enough. Different tests catch different kinds of distributional differences. Mann-Whitney U detects location shifts (one distribution systematically higher than the other). Kolmogorov-Smirnov catches any distributional difference, including shape and variance changes. Use both.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
# statistical_tests.py
import numpy as np
from scipy import stats
from dataclasses import dataclass


@dataclass
class AnalysisResult:
    mann_whitney_pvalue: float
    ks_pvalue: float
    cohens_d: float
    bootstrap_ci: tuple[float, float]
    baseline_mean: float
    canary_mean: float
    decision: str
    reason: str


def cohens_d(group1: np.ndarray, group2: np.ndarray) -> float:
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(ddof=1), group2.var(ddof=1)
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
    if pooled_std == 0:
        return 0.0
    return (group2.mean() - group1.mean()) / pooled_std


def bootstrap_mean_diff_ci(
    baseline: np.ndarray,
    canary: np.ndarray,
    n_bootstrap: int = 10000,
    confidence: float = 0.95,
) -> tuple[float, float]:
    rng = np.random.default_rng(seed=42)
    diffs = np.empty(n_bootstrap)
    for i in range(n_bootstrap):
        b_sample = rng.choice(baseline, size=len(baseline), replace=True)
        c_sample = rng.choice(canary, size=len(canary), replace=True)
        diffs[i] = c_sample.mean() - b_sample.mean()
    alpha = 1 - confidence
    lower = float(np.percentile(diffs, 100 * alpha / 2))
    upper = float(np.percentile(diffs, 100 * (1 - alpha / 2)))
    return (lower, upper)


def analyze_canary(
    baseline_scores: list[float],
    canary_scores: list[float],
    p_threshold: float = 0.05,
    effect_threshold: float = 0.2,
    min_samples: int = 30,
) -> AnalysisResult:
    baseline = np.array(baseline_scores)
    canary = np.array(canary_scores)

    if len(baseline) < min_samples or len(canary) < min_samples:
        return AnalysisResult(
            mann_whitney_pvalue=1.0,
            ks_pvalue=1.0,
            cohens_d=0.0,
            bootstrap_ci=(0.0, 0.0),
            baseline_mean=float(baseline.mean()) if len(baseline) > 0 else 0.0,
            canary_mean=float(canary.mean()) if len(canary) > 0 else 0.0,
            decision="wait",
            reason=f"Not enough samples. Need {min_samples}, got baseline={len(baseline)}, canary={len(canary)}",
        )

    _, mw_pvalue = stats.mannwhitneyu(baseline, canary, alternative="two-sided")
    _, ks_pvalue = stats.ks_2samp(baseline, canary)
    d = cohens_d(baseline, canary)
    ci = bootstrap_mean_diff_ci(baseline, canary)

    # Decision logic: significant difference AND meaningful effect size
    significant = mw_pvalue < p_threshold or ks_pvalue < p_threshold
    large_effect = abs(d) >= effect_threshold

    if significant and large_effect:
        if d > 0:
            decision = "promote"
            reason = f"Canary is significantly better (d={d:.3f}, MW p={mw_pvalue:.4f}, KS p={ks_pvalue:.4f})"
        else:
            decision = "rollback"
            reason = f"Canary is significantly worse (d={d:.3f}, MW p={mw_pvalue:.4f}, KS p={ks_pvalue:.4f})"
    elif significant and not large_effect:
        decision = "wait"
        reason = f"Statistically significant but small effect (d={d:.3f}). Collect more data"
    else:
        decision = "wait"
        reason = f"No significant difference yet (MW p={mw_pvalue:.4f}, KS p={ks_pvalue:.4f})"

    return AnalysisResult(
        mann_whitney_pvalue=float(mw_pvalue),
        ks_pvalue=float(ks_pvalue),
        cohens_d=float(d),
        bootstrap_ci=ci,
        baseline_mean=float(baseline.mean()),
        canary_mean=float(canary.mean()),
        decision=decision,
        reason=reason,
    )

The decision logic has three outcomes. Promote when you see a statistically significant improvement with a meaningful effect size (Cohen’s d >= 0.2). Rollback when you see a significant degradation. Wait for everything else, including cases where the difference is real but tiny. A p-value of 0.001 with a Cohen’s d of 0.05 means the difference exists but does not matter in practice.

Exposing the Analysis via FastAPI

Wire it all together with a FastAPI app. Use the lifespan context manager to initialize the metrics collector at startup and clean up on shutdown.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# app.py
from contextlib import asynccontextmanager
from typing import Any

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

# Import from the modules above
from metrics_collector import CanaryMetricsCollector
from statistical_tests import analyze_canary, AnalysisResult


collector = CanaryMetricsCollector()


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: initialize collector (already done at module level, but you
    # could load config or connect to Redis here)
    app.state.collector = collector
    yield
    # Shutdown: flush any remaining data
    collector.clear()


app = FastAPI(title="Canary Analysis Service", lifespan=lifespan)


class ScorePayload(BaseModel):
    version: str  # "baseline" or "canary"
    score: float


class AnalysisConfig(BaseModel):
    p_threshold: float = 0.05
    effect_threshold: float = 0.2
    min_samples: int = 30


@app.post("/record")
async def record_score(payload: ScorePayload) -> dict[str, str]:
    if payload.version not in ("baseline", "canary"):
        raise HTTPException(status_code=400, detail="version must be 'baseline' or 'canary'")
    collector.record(payload.version, payload.score)
    return {"status": "recorded"}


@app.post("/analyze")
async def run_analysis(config: AnalysisConfig = AnalysisConfig()) -> dict[str, Any]:
    result = analyze_canary(
        baseline_scores=collector.baseline_scores,
        canary_scores=collector.canary_scores,
        p_threshold=config.p_threshold,
        effect_threshold=config.effect_threshold,
        min_samples=config.min_samples,
    )
    counts = collector.sample_counts()
    return {
        "decision": result.decision,
        "reason": result.reason,
        "stats": {
            "mann_whitney_pvalue": result.mann_whitney_pvalue,
            "ks_pvalue": result.ks_pvalue,
            "cohens_d": result.cohens_d,
            "bootstrap_ci_95": list(result.bootstrap_ci),
            "baseline_mean": result.baseline_mean,
            "canary_mean": result.canary_mean,
        },
        "sample_counts": counts,
    }


@app.get("/status")
async def status() -> dict[str, Any]:
    return collector.sample_counts()


@app.post("/reset")
async def reset() -> dict[str, str]:
    collector.clear()
    return {"status": "cleared"}

Run it:

1
uvicorn app:app --host 0.0.0.0 --port 8001

Test it by posting scores and triggering analysis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Record some baseline scores
for i in $(seq 1 50); do
  curl -s -X POST http://localhost:8001/record \
    -H "Content-Type: application/json" \
    -d "{\"version\": \"baseline\", \"score\": 0.$(shuf -i 82-88 -n 1)}"
done

# Record canary scores (slightly better)
for i in $(seq 1 50); do
  curl -s -X POST http://localhost:8001/record \
    -H "Content-Type: application/json" \
    -d "{\"version\": \"canary\", \"score\": 0.$(shuf -i 85-92 -n 1)}"
done

# Run analysis
curl -s -X POST http://localhost:8001/analyze | python3 -m json.tool

The response tells you exactly what to do: promote, rollback, or keep waiting.

Tuning Thresholds for Your Use Case

The default p_threshold=0.05 and effect_threshold=0.2 are reasonable starting points, but they are not gospel. Think about what matters for your specific model.

For high-stakes models (fraud detection, medical, safety-critical), tighten the thresholds. Use p_threshold=0.01 and effect_threshold=0.1. You want to catch even small degradations before they hit all traffic.

For recommendation models or systems where a small quality dip is tolerable, you can relax to p_threshold=0.05 and effect_threshold=0.3. This avoids rollbacks on noise while still catching real regressions.

The min_samples parameter matters more than you think. With fewer than 30 samples per group, statistical tests have very low power. You will miss real differences. For latency comparisons where variance is high, bump this to 100 or more.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Example: strict thresholds for a fraud detection canary
strict_result = analyze_canary(
    baseline_scores=collector.baseline_scores,
    canary_scores=collector.canary_scores,
    p_threshold=0.01,
    effect_threshold=0.1,
    min_samples=100,
)
print(f"Decision: {strict_result.decision}")
print(f"Reason: {strict_result.reason}")
print(f"Cohen's d: {strict_result.cohens_d:.4f}")
print(f"95% CI for mean diff: {strict_result.bootstrap_ci}")

A good pattern is to store these thresholds per-model in a config file or environment variable. Different models in the same system might need different sensitivity levels.

Common Errors and Fixes

ValueError: sample size must be greater than zero

This happens when one of the score lists is empty. The analyze_canary function guards against this with min_samples, but if you call scipy directly without checking, you will hit it.

1
2
3
4
5
6
# Wrong: calling stats directly on possibly-empty data
_, p = stats.mannwhitneyu([], [1.0, 2.0])  # ValueError

# Right: check first
if len(baseline) >= 30 and len(canary) >= 30:
    _, p = stats.mannwhitneyu(baseline, canary, alternative="two-sided")

RuntimeWarning: divide by zero encountered in scalar divide

This occurs in cohens_d when both groups have zero variance (every value is identical). The fix is the if pooled_std == 0 guard in the function above. If you see this warning, your test data is likely synthetic with no noise. Real prediction scores always have some variance.

Mann-Whitney U gives p=1.0 even though means look different

You probably have too few samples. With 10 samples per group, Mann-Whitney has almost no power to detect differences under 0.5 standard deviations. Increase your sample size to at least 30, ideally 100+.

422 Unprocessable Entity from FastAPI

You are sending the wrong JSON shape to /record or /analyze. Check that version is a string ("baseline" or "canary") and score is a float. FastAPI uses Pydantic validation and will reject anything that does not match the schema.

1
2
3
4
5
6
7
# Wrong: missing quotes around version
curl -X POST http://localhost:8001/record -d '{"version": baseline, "score": 0.85}'

# Right
curl -X POST http://localhost:8001/record \
  -H "Content-Type: application/json" \
  -d '{"version": "baseline", "score": 0.85}'

Bootstrap CI is very wide

Wide confidence intervals mean high variance in your data, small sample sizes, or both. The fix is always more data. If you cannot collect more, reduce n_bootstrap to speed things up but do not use it to artificially narrow the interval. The width is telling you something real about your uncertainty.

Collecting Prediction Distributions#

Running Multiple Statistical Tests#

Exposing the Analysis via FastAPI#

Tuning Thresholds for Your Use Case#

Common Errors and Fixes#

Related Guides#

About the Author

Collecting Prediction Distributions

Running Multiple Statistical Tests

Exposing the Analysis via FastAPI

Tuning Thresholds for Your Use Case

Common Errors and Fixes

Related Guides