The Quick Version#
A canary deployment sends a small percentage of traffic (5-10%) to a new model version while the rest stays on the current version. You compare metrics between the two, and if the canary performs well, you gradually shift all traffic to it. If it degrades, you roll back instantly.
1
| pip install fastapi uvicorn httpx numpy
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| import random
from fastapi import FastAPI, Request
from pydantic import BaseModel
app = FastAPI()
# Simulate two model versions
def model_v1(text: str) -> dict:
return {"version": "v1", "prediction": "positive", "confidence": 0.87}
def model_v2(text: str) -> dict:
return {"version": "v2", "prediction": "positive", "confidence": 0.91}
# Canary configuration
CANARY_PERCENTAGE = 10 # 10% of traffic goes to v2
class PredictRequest(BaseModel):
text: str
@app.post("/predict")
async def predict(req: PredictRequest):
# Route based on random percentage
if random.randint(1, 100) <= CANARY_PERCENTAGE:
result = model_v2(req.text)
result["canary"] = True
else:
result = model_v1(req.text)
result["canary"] = False
return result
|
1
| uvicorn server:app --host 0.0.0.0 --port 8000
|
That’s the simplest canary. 10% of requests go to v2, 90% to v1. In production, you need metric tracking and automated rollback — which is what the rest of this guide covers.
Tracking Metrics for Both Versions#
You need to compare the canary against the baseline on the same metrics: latency, error rate, prediction distribution, and any business metrics you track.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| import time
from collections import defaultdict
from dataclasses import dataclass, field
from threading import Lock
@dataclass
class VersionMetrics:
request_count: int = 0
error_count: int = 0
total_latency: float = 0.0
predictions: list = field(default_factory=list)
lock: Lock = field(default_factory=Lock)
def record(self, latency: float, prediction: str, error: bool = False):
with self.lock:
self.request_count += 1
self.total_latency += latency
self.predictions.append(prediction)
if error:
self.error_count += 1
def summary(self) -> dict:
with self.lock:
n = max(self.request_count, 1)
return {
"request_count": self.request_count,
"error_rate": self.error_count / n,
"avg_latency_ms": (self.total_latency / n) * 1000,
"prediction_distribution": {
p: self.predictions.count(p) / len(self.predictions)
for p in set(self.predictions)
} if self.predictions else {},
}
metrics = {"v1": VersionMetrics(), "v2": VersionMetrics()}
@app.post("/predict")
async def predict_with_metrics(req: PredictRequest):
is_canary = random.randint(1, 100) <= CANARY_PERCENTAGE
version = "v2" if is_canary else "v1"
model_fn = model_v2 if is_canary else model_v1
start = time.time()
try:
result = model_fn(req.text)
latency = time.time() - start
metrics[version].record(latency, result["prediction"])
except Exception as e:
latency = time.time() - start
metrics[version].record(latency, "error", error=True)
raise
result["version"] = version
return result
@app.get("/metrics")
async def get_metrics():
return {v: m.summary() for v, m in metrics.items()}
|
Hit /metrics to compare versions side by side. You’re looking for the canary to have equal or better latency, equal or lower error rate, and a similar prediction distribution (large shifts suggest a bug).
The canary controller checks metrics periodically and decides whether to increase traffic, hold, or roll back.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
| import threading
class CanaryController:
def __init__(
self,
metrics: dict,
max_error_rate_increase: float = 0.02, # allow 2% more errors than baseline
max_latency_increase_ms: float = 50, # allow 50ms more latency
promotion_steps: list[int] = None,
):
self.metrics = metrics
self.max_error_rate_increase = max_error_rate_increase
self.max_latency_increase_ms = max_latency_increase_ms
self.promotion_steps = promotion_steps or [10, 25, 50, 75, 100]
self.current_step = 0
self.current_percentage = self.promotion_steps[0]
self.status = "running" # running, promoted, rolled_back
def evaluate(self) -> str:
"""Check if canary is healthy and decide next action."""
v1 = self.metrics["v1"].summary()
v2 = self.metrics["v2"].summary()
# Need minimum samples before deciding
if v2["request_count"] < 100:
return "waiting"
# Check error rate
error_increase = v2["error_rate"] - v1["error_rate"]
if error_increase > self.max_error_rate_increase:
self.status = "rolled_back"
self.current_percentage = 0
return f"ROLLBACK: error rate +{error_increase:.3f}"
# Check latency
latency_increase = v2["avg_latency_ms"] - v1["avg_latency_ms"]
if latency_increase > self.max_latency_increase_ms:
self.status = "rolled_back"
self.current_percentage = 0
return f"ROLLBACK: latency +{latency_increase:.1f}ms"
# Canary is healthy — promote to next step
self.current_step += 1
if self.current_step >= len(self.promotion_steps):
self.status = "promoted"
self.current_percentage = 100
return "PROMOTED: canary is now serving all traffic"
self.current_percentage = self.promotion_steps[self.current_step]
return f"PROMOTED to {self.current_percentage}%"
controller = CanaryController(metrics)
# Run evaluation every 60 seconds
def evaluation_loop():
while controller.status == "running":
result = controller.evaluate()
print(f"Canary evaluation: {result}")
global CANARY_PERCENTAGE
CANARY_PERCENTAGE = controller.current_percentage
time.sleep(60)
threading.Thread(target=evaluation_loop, daemon=True).start()
|
The promotion steps (10% → 25% → 50% → 75% → 100%) give you multiple checkpoints. If the canary degrades at 25% traffic, you catch it before it affects most users.
Canary with Kubernetes and Istio#
For production Kubernetes deployments, use Istio’s traffic splitting instead of application-level routing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| # canary-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ml-model-service
spec:
hosts:
- ml-model-service
http:
- route:
- destination:
host: ml-model-service
subset: stable
weight: 90
- destination:
host: ml-model-service
subset: canary
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: ml-model-service
spec:
host: ml-model-service
subsets:
- name: stable
labels:
version: v1
- name: canary
labels:
version: v2
|
1
| kubectl apply -f canary-virtualservice.yaml
|
To promote, change the weights. To roll back, set canary weight to 0:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| # Promote to 50/50
kubectl patch virtualservice ml-model-service --type merge -p '
spec:
http:
- route:
- destination:
host: ml-model-service
subset: stable
weight: 50
- destination:
host: ml-model-service
subset: canary
weight: 50
'
# Full rollback
kubectl patch virtualservice ml-model-service --type merge -p '
spec:
http:
- route:
- destination:
host: ml-model-service
subset: stable
weight: 100
- destination:
host: ml-model-service
subset: canary
weight: 0
'
|
Sticky Sessions for Consistent User Experience#
Random routing means the same user might get different model versions on consecutive requests. For chat or multi-turn applications, use sticky sessions based on user ID:
1
2
3
4
5
6
7
8
9
10
11
| import hashlib
def route_by_user(user_id: str, canary_percentage: int = 10) -> str:
"""Deterministic routing based on user ID."""
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
bucket = hash_val % 100
return "v2" if bucket < canary_percentage else "v1"
# Same user always gets the same version
print(route_by_user("user_123", canary_percentage=10)) # always v1 or always v2
print(route_by_user("user_456", canary_percentage=10))
|
This ensures user “user_123” always hits the same model version during the canary period. When you promote or roll back, the hash boundaries shift and everyone moves together.
Common Errors and Fixes#
Canary shows better metrics than baseline but it’s a fluke
With small sample sizes, random variation can look like real improvement. Wait for at least 1000 requests per version before making promotion decisions. For statistical rigor, use a two-sample t-test on latency distributions.
Prediction distribution shifted but error rate is fine
A model that predicts “positive” 80% of the time instead of 60% might not show errors in your metrics, but it’s likely broken. Track prediction distribution as a first-class metric and alert on significant shifts.
Canary pod is slower because it’s cold
New pods haven’t warmed up their caches, JIT compilation, or GPU kernels. Send warmup traffic to the canary before including it in the routing pool. For LLM serving, run a few dummy inference requests during pod startup.
Traffic split isn’t exact
Random routing with 10% canary means approximately 10%, not exactly. For 100 requests, you might see 7 or 13 go to canary. This is fine — the law of large numbers smooths it out over thousands of requests.
Rollback doesn’t happen fast enough
Reduce the evaluation interval from 60 seconds to 15 seconds, and add an immediate rollback trigger on error rate spikes above 10%. The evaluation loop should also check health on every request if error rates are elevated.