How to Implement Canary Deployments for ML Models

The Quick Version

A canary deployment sends a small percentage of traffic (5-10%) to a new model version while the rest stays on the current version. You compare metrics between the two, and if the canary performs well, you gradually shift all traffic to it. If it degrades, you roll back instantly.

1
pip install fastapi uvicorn httpx numpy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import random
from fastapi import FastAPI, Request
from pydantic import BaseModel

app = FastAPI()

# Simulate two model versions
def model_v1(text: str) -> dict:
    return {"version": "v1", "prediction": "positive", "confidence": 0.87}

def model_v2(text: str) -> dict:
    return {"version": "v2", "prediction": "positive", "confidence": 0.91}

# Canary configuration
CANARY_PERCENTAGE = 10  # 10% of traffic goes to v2

class PredictRequest(BaseModel):
    text: str

@app.post("/predict")
async def predict(req: PredictRequest):
    # Route based on random percentage
    if random.randint(1, 100) <= CANARY_PERCENTAGE:
        result = model_v2(req.text)
        result["canary"] = True
    else:
        result = model_v1(req.text)
        result["canary"] = False
    return result

1
uvicorn server:app --host 0.0.0.0 --port 8000

That’s the simplest canary. 10% of requests go to v2, 90% to v1. In production, you need metric tracking and automated rollback — which is what the rest of this guide covers.

Tracking Metrics for Both Versions

You need to compare the canary against the baseline on the same metrics: latency, error rate, prediction distribution, and any business metrics you track.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import time
from collections import defaultdict
from dataclasses import dataclass, field
from threading import Lock

@dataclass
class VersionMetrics:
    request_count: int = 0
    error_count: int = 0
    total_latency: float = 0.0
    predictions: list = field(default_factory=list)
    lock: Lock = field(default_factory=Lock)

    def record(self, latency: float, prediction: str, error: bool = False):
        with self.lock:
            self.request_count += 1
            self.total_latency += latency
            self.predictions.append(prediction)
            if error:
                self.error_count += 1

    def summary(self) -> dict:
        with self.lock:
            n = max(self.request_count, 1)
            return {
                "request_count": self.request_count,
                "error_rate": self.error_count / n,
                "avg_latency_ms": (self.total_latency / n) * 1000,
                "prediction_distribution": {
                    p: self.predictions.count(p) / len(self.predictions)
                    for p in set(self.predictions)
                } if self.predictions else {},
            }

metrics = {"v1": VersionMetrics(), "v2": VersionMetrics()}

@app.post("/predict")
async def predict_with_metrics(req: PredictRequest):
    is_canary = random.randint(1, 100) <= CANARY_PERCENTAGE
    version = "v2" if is_canary else "v1"
    model_fn = model_v2 if is_canary else model_v1

    start = time.time()
    try:
        result = model_fn(req.text)
        latency = time.time() - start
        metrics[version].record(latency, result["prediction"])
    except Exception as e:
        latency = time.time() - start
        metrics[version].record(latency, "error", error=True)
        raise

    result["version"] = version
    return result

@app.get("/metrics")
async def get_metrics():
    return {v: m.summary() for v, m in metrics.items()}

Hit /metrics to compare versions side by side. You’re looking for the canary to have equal or better latency, equal or lower error rate, and a similar prediction distribution (large shifts suggest a bug).

Automated Promotion and Rollback

The canary controller checks metrics periodically and decides whether to increase traffic, hold, or roll back.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import threading

class CanaryController:
    def __init__(
        self,
        metrics: dict,
        max_error_rate_increase: float = 0.02,   # allow 2% more errors than baseline
        max_latency_increase_ms: float = 50,      # allow 50ms more latency
        promotion_steps: list[int] = None,
    ):
        self.metrics = metrics
        self.max_error_rate_increase = max_error_rate_increase
        self.max_latency_increase_ms = max_latency_increase_ms
        self.promotion_steps = promotion_steps or [10, 25, 50, 75, 100]
        self.current_step = 0
        self.current_percentage = self.promotion_steps[0]
        self.status = "running"  # running, promoted, rolled_back

    def evaluate(self) -> str:
        """Check if canary is healthy and decide next action."""
        v1 = self.metrics["v1"].summary()
        v2 = self.metrics["v2"].summary()

        # Need minimum samples before deciding
        if v2["request_count"] < 100:
            return "waiting"

        # Check error rate
        error_increase = v2["error_rate"] - v1["error_rate"]
        if error_increase > self.max_error_rate_increase:
            self.status = "rolled_back"
            self.current_percentage = 0
            return f"ROLLBACK: error rate +{error_increase:.3f}"

        # Check latency
        latency_increase = v2["avg_latency_ms"] - v1["avg_latency_ms"]
        if latency_increase > self.max_latency_increase_ms:
            self.status = "rolled_back"
            self.current_percentage = 0
            return f"ROLLBACK: latency +{latency_increase:.1f}ms"

        # Canary is healthy — promote to next step
        self.current_step += 1
        if self.current_step >= len(self.promotion_steps):
            self.status = "promoted"
            self.current_percentage = 100
            return "PROMOTED: canary is now serving all traffic"

        self.current_percentage = self.promotion_steps[self.current_step]
        return f"PROMOTED to {self.current_percentage}%"

controller = CanaryController(metrics)

# Run evaluation every 60 seconds
def evaluation_loop():
    while controller.status == "running":
        result = controller.evaluate()
        print(f"Canary evaluation: {result}")
        global CANARY_PERCENTAGE
        CANARY_PERCENTAGE = controller.current_percentage
        time.sleep(60)

threading.Thread(target=evaluation_loop, daemon=True).start()

The promotion steps (10% → 25% → 50% → 75% → 100%) give you multiple checkpoints. If the canary degrades at 25% traffic, you catch it before it affects most users.

Canary with Kubernetes and Istio

For production Kubernetes deployments, use Istio’s traffic splitting instead of application-level routing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# canary-virtualservice.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ml-model-service
spec:
  hosts:
  - ml-model-service
  http:
  - route:
    - destination:
        host: ml-model-service
        subset: stable
      weight: 90
    - destination:
        host: ml-model-service
        subset: canary
      weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: ml-model-service
spec:
  host: ml-model-service
  subsets:
  - name: stable
    labels:
      version: v1
  - name: canary
    labels:
      version: v2

1
kubectl apply -f canary-virtualservice.yaml

To promote, change the weights. To roll back, set canary weight to 0:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Promote to 50/50
kubectl patch virtualservice ml-model-service --type merge -p '
spec:
  http:
  - route:
    - destination:
        host: ml-model-service
        subset: stable
      weight: 50
    - destination:
        host: ml-model-service
        subset: canary
      weight: 50
'

# Full rollback
kubectl patch virtualservice ml-model-service --type merge -p '
spec:
  http:
  - route:
    - destination:
        host: ml-model-service
        subset: stable
      weight: 100
    - destination:
        host: ml-model-service
        subset: canary
      weight: 0
'

Sticky Sessions for Consistent User Experience

Random routing means the same user might get different model versions on consecutive requests. For chat or multi-turn applications, use sticky sessions based on user ID:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import hashlib

def route_by_user(user_id: str, canary_percentage: int = 10) -> str:
    """Deterministic routing based on user ID."""
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    bucket = hash_val % 100
    return "v2" if bucket < canary_percentage else "v1"

# Same user always gets the same version
print(route_by_user("user_123", canary_percentage=10))  # always v1 or always v2
print(route_by_user("user_456", canary_percentage=10))

This ensures user “user_123” always hits the same model version during the canary period. When you promote or roll back, the hash boundaries shift and everyone moves together.

Common Errors and Fixes

Canary shows better metrics than baseline but it’s a fluke

With small sample sizes, random variation can look like real improvement. Wait for at least 1000 requests per version before making promotion decisions. For statistical rigor, use a two-sample t-test on latency distributions.

Prediction distribution shifted but error rate is fine

A model that predicts “positive” 80% of the time instead of 60% might not show errors in your metrics, but it’s likely broken. Track prediction distribution as a first-class metric and alert on significant shifts.

Canary pod is slower because it’s cold

New pods haven’t warmed up their caches, JIT compilation, or GPU kernels. Send warmup traffic to the canary before including it in the routing pool. For LLM serving, run a few dummy inference requests during pod startup.

Traffic split isn’t exact

Random routing with 10% canary means approximately 10%, not exactly. For 100 requests, you might see 7 or 13 go to canary. This is fine — the law of large numbers smooths it out over thousands of requests.

Rollback doesn’t happen fast enough

Reduce the evaluation interval from 60 seconds to 15 seconds, and add an immediate rollback trigger on error rate spikes above 10%. The evaluation loop should also check health on every request if error rates are elevated.

The Quick Version#

Tracking Metrics for Both Versions#

Automated Promotion and Rollback#

Canary with Kubernetes and Istio#

Sticky Sessions for Consistent User Experience#

Common Errors and Fixes#

Related Guides#

About the Author