You trained a new model. It looks better on your test set. But will it actually perform better in production with real traffic? The only way to know is to A/B test it.

Here’s what we’re building: a FastAPI service that splits incoming prediction requests between two model versions, logs everything, and lets you run statistical tests to pick a winner.

Install the dependencies first:

1
pip install fastapi uvicorn scikit-learn numpy scipy pydantic

Train Two Model Versions

Before we build the API, we need two models to compare. We’ll train a LogisticRegression (model A) and a tuned version with different hyperparameters (model B) on the Iris dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# train_models.py
import pickle
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Model A: default LogisticRegression
model_a = LogisticRegression(max_iter=200, random_state=42)
model_a.fit(X_train, y_train)
print(f"Model A accuracy: {accuracy_score(y_test, model_a.predict(X_test)):.3f}")

# Model B: different regularization strength
model_b = LogisticRegression(C=0.1, max_iter=200, random_state=42)
model_b.fit(X_train, y_train)
print(f"Model B accuracy: {accuracy_score(y_test, model_b.predict(X_test)):.3f}")

with open("model_a.pkl", "wb") as f:
    pickle.dump(model_a, f)

with open("model_b.pkl", "wb") as f:
    pickle.dump(model_b, f)

print("Saved model_a.pkl and model_b.pkl")

Run it:

1
python train_models.py

You’ll get two pickle files on disk. Now we build the API around them.

Build the FastAPI A/B Testing Service

The core idea is simple: each request gets randomly assigned to model A or model B based on a weight you configure. We use FastAPI’s lifespan context manager to load both models at startup and clean up on shutdown.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# app.py
import pickle
import random
import time
from contextlib import asynccontextmanager
from typing import Any

import numpy as np
from fastapi import FastAPI
from pydantic import BaseModel

# --- Schemas ---

class PredictionRequest(BaseModel):
    features: list[float]

class PredictionResponse(BaseModel):
    prediction: int
    model_version: str
    latency_ms: float

class ABConfig(BaseModel):
    model_a_weight: float = 0.5
    model_b_weight: float = 0.5

# --- Traffic Splitter ---

class TrafficSplitter:
    def __init__(self, weight_a: float = 0.5):
        self.weight_a = weight_a

    def assign(self) -> str:
        return "model_a" if random.random() < self.weight_a else "model_b"

    def update_weights(self, weight_a: float) -> None:
        self.weight_a = weight_a

# --- Prediction Logger ---

prediction_log: list[dict[str, Any]] = []

def log_prediction(model_version: str, prediction: int, latency_ms: float, features: list[float]) -> None:
    prediction_log.append({
        "model_version": model_version,
        "prediction": prediction,
        "latency_ms": latency_ms,
        "features": features,
        "timestamp": time.time(),
    })

# --- App Setup with Lifespan ---

models: dict[str, Any] = {}
splitter = TrafficSplitter(weight_a=0.5)

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load models on startup
    with open("model_a.pkl", "rb") as f:
        models["model_a"] = pickle.load(f)
    with open("model_b.pkl", "rb") as f:
        models["model_b"] = pickle.load(f)
    print(f"Loaded {len(models)} models")
    yield
    # Cleanup on shutdown
    models.clear()
    print("Models unloaded")

app = FastAPI(title="Model A/B Testing Framework", lifespan=lifespan)

# --- Endpoints ---

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    model_version = splitter.assign()
    model = models[model_version]

    start = time.perf_counter()
    features_array = np.array(request.features).reshape(1, -1)
    prediction = int(model.predict(features_array)[0])
    latency_ms = (time.perf_counter() - start) * 1000

    log_prediction(model_version, prediction, latency_ms, request.features)

    return PredictionResponse(
        prediction=prediction,
        model_version=model_version,
        latency_ms=round(latency_ms, 3),
    )

@app.get("/logs")
async def get_logs():
    return {"total": len(prediction_log), "logs": prediction_log[-100:]}

@app.put("/config")
async def update_config(config: ABConfig):
    splitter.update_weights(config.model_a_weight)
    return {"message": f"Updated weights: A={config.model_a_weight}, B={config.model_b_weight}"}

@app.get("/stats")
async def get_stats():
    if not prediction_log:
        return {"message": "No predictions logged yet"}

    a_logs = [l for l in prediction_log if l["model_version"] == "model_a"]
    b_logs = [l for l in prediction_log if l["model_version"] == "model_b"]

    return {
        "model_a": {
            "count": len(a_logs),
            "avg_latency_ms": round(np.mean([l["latency_ms"] for l in a_logs]), 3) if a_logs else 0,
        },
        "model_b": {
            "count": len(b_logs),
            "avg_latency_ms": round(np.mean([l["latency_ms"] for l in b_logs]), 3) if b_logs else 0,
        },
    }

Start the server:

1
uvicorn app:app --host 0.0.0.0 --port 8000

Send a test prediction:

1
2
3
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [5.1, 3.5, 1.4, 0.2]}'

The response tells you which model handled the request:

1
{"prediction": 0, "model_version": "model_a", "latency_ms": 0.142}

Analyze Results with Statistical Tests

After collecting enough predictions, you need to determine if one model is actually better. If you’re comparing accuracy against known labels, use a chi-squared test. If you’re comparing latency, use a t-test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# analyze.py
import requests
import numpy as np
from scipy import stats

# Fetch logs from the running server
response = requests.get("http://localhost:8000/logs")
logs = response.json()["logs"]

a_latencies = [l["latency_ms"] for l in logs if l["model_version"] == "model_a"]
b_latencies = [l["latency_ms"] for l in logs if l["model_version"] == "model_b"]

print(f"Model A: {len(a_latencies)} requests, avg latency {np.mean(a_latencies):.3f} ms")
print(f"Model B: {len(b_latencies)} requests, avg latency {np.mean(b_latencies):.3f} ms")

# Welch's t-test for latency comparison
if len(a_latencies) >= 30 and len(b_latencies) >= 30:
    t_stat, p_value = stats.ttest_ind(a_latencies, b_latencies, equal_var=False)
    print(f"\nWelch's t-test: t={t_stat:.4f}, p={p_value:.4f}")
    if p_value < 0.05:
        winner = "Model A" if np.mean(a_latencies) < np.mean(b_latencies) else "Model B"
        print(f"Statistically significant difference (p < 0.05). {winner} is faster.")
    else:
        print("No statistically significant difference in latency.")
else:
    print(f"\nNeed at least 30 samples per model. A has {len(a_latencies)}, B has {len(b_latencies)}.")

For accuracy comparison, you need ground truth labels. Here’s how to run a chi-squared test when you have them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# accuracy_test.py
import numpy as np
from scipy.stats import chi2_contingency

# Example: you logged predictions and later got ground truth labels
# These would come from your labeling pipeline in practice
model_a_correct = 142
model_a_incorrect = 18
model_b_correct = 137
model_b_incorrect = 23

contingency_table = np.array([
    [model_a_correct, model_a_incorrect],
    [model_b_correct, model_b_incorrect],
])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Model A accuracy: {model_a_correct / (model_a_correct + model_a_incorrect):.3f}")
print(f"Model B accuracy: {model_b_correct / (model_b_correct + model_b_incorrect):.3f}")
print(f"Chi-squared: {chi2:.4f}, p-value: {p_value:.4f}, dof: {dof}")

if p_value < 0.05:
    a_acc = model_a_correct / (model_a_correct + model_a_incorrect)
    b_acc = model_b_correct / (model_b_correct + model_b_incorrect)
    winner = "Model A" if a_acc > b_acc else "Model B"
    print(f"Significant difference. {winner} has higher accuracy.")
else:
    print("No significant difference in accuracy between models.")

A few things to watch out for: don’t check significance too early. You need at least 30 samples per variant for the t-test to be meaningful, and for chi-squared you want at least 5 expected observations in every cell of the contingency table. Checking after every request inflates your false positive rate — set a sample size target upfront and evaluate once.

Simulate Traffic for Testing

You don’t want to wait for real traffic to verify the framework works. Fire off a batch of requests with a simple script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# simulate_traffic.py
import requests
import random

sample_features = [
    [5.1, 3.5, 1.4, 0.2],
    [7.0, 3.2, 4.7, 1.4],
    [6.3, 3.3, 6.0, 2.5],
    [4.9, 3.0, 1.4, 0.2],
    [6.4, 3.2, 4.5, 1.5],
    [5.8, 2.7, 5.1, 1.9],
]

url = "http://localhost:8000/predict"

for i in range(200):
    features = random.choice(sample_features)
    noise = [f + random.gauss(0, 0.1) for f in features]
    response = requests.post(url, json={"features": noise})
    result = response.json()
    if i % 50 == 0:
        print(f"Request {i}: {result['model_version']} -> prediction={result['prediction']}")

print("Done. Check /stats and /logs for results.")

Run it, then hit the stats endpoint:

1
2
python simulate_traffic.py
curl http://localhost:8000/stats

You should see roughly 50/50 traffic split between the two models. Adjust the weights by calling the config endpoint:

1
2
3
curl -X PUT http://localhost:8000/config \
  -H "Content-Type: application/json" \
  -d '{"model_a_weight": 0.9, "model_b_weight": 0.1}'

This shifts 90% of traffic to model A — useful when you’re nearly done testing and want to ramp down the challenger.

Common Errors and Fixes

FileNotFoundError: [Errno 2] No such file or directory: 'model_a.pkl'

You started the server before training the models. Run python train_models.py first to create the pickle files, then start uvicorn.

ValueError: X has 3 features, but LogisticRegression expects 4 features as input

Your request payload has the wrong number of features. The Iris models expect exactly 4 float values. Check your features array length.

422 Unprocessable Entity from FastAPI

Pydantic validation failed. The most common cause is sending a string where a float is expected, or missing the features key entirely. Check your JSON payload matches the schema:

1
{"features": [5.1, 3.5, 1.4, 0.2]}

TypeError: Object of type int64 is not JSON serializable

NumPy integers don’t serialize to JSON automatically. The int() cast in the predict endpoint handles this. If you add new fields from numpy arrays, wrap them with int() or float() before returning.

Uneven traffic split despite 50/50 weights

With small sample sizes, randomness causes visible skew. At 100 requests you might see 55/45 — that’s normal. The split converges to your target weights as volume increases. If you need deterministic assignment (for user-sticky experiments), hash the user ID instead of using random.random().