Your model worked great three months ago. Now accuracy is quietly tanking and nobody noticed until a customer complained. That’s drift – the slow poison of production ML. The training data no longer represents what the model sees in the real world, and predictions degrade without any code change or deployment event triggering an alert.

Detecting drift early is the difference between a quick retrain and a weeks-long fire drill. Here’s how to set it up properly.

Types of Drift You Need to Watch

There are three distinct failure modes, and they require different detection strategies:

Data drift (also called covariate shift) – the input feature distributions change. Maybe your e-commerce model trained on US traffic starts getting requests from a new market. The model’s inputs look different from training data, even if the relationship between inputs and outputs hasn’t changed.

Concept drift – the relationship between inputs and outputs changes. Your fraud model learned that transactions over $500 from new accounts are suspicious. Then the fraud pattern shifts to small recurring charges. The features look the same, but the correct labels have changed.

Prediction drift – the model’s output distribution shifts. Even without ground truth labels, you can monitor whether the model is predicting class A 80% of the time when it used to predict it 40% of the time. This is the cheapest signal to monitor because you don’t need labels.

My recommendation: monitor all three. Data drift and prediction drift are free (no labels needed). Concept drift requires delayed ground truth, but it’s the one that actually tells you the model is wrong.

Quick Drift Detection with Evidently AI

Evidently is the best open-source library for drift detection. It handles the statistical testing for you and generates reports you can plug into dashboards or CI pipelines.

Install it and run your first drift report in under 20 lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

# reference = your training data, current = recent production data
reference = pd.read_csv("data/training_sample.csv")
current = pd.read_csv("data/production_last_7_days.csv")

# Run data drift detection across all features
drift_report = Report(metrics=[
    DataDriftPreset(),       # checks every feature column
    TargetDriftPreset(),     # checks prediction/target distribution
])

drift_report.run(reference_data=reference, current_data=current)

# Get results as a dictionary for programmatic access
results = drift_report.as_dict()

drift_detected = results["metrics"][0]["result"]["dataset_drift"]
drifted_features = results["metrics"][0]["result"]["number_of_drifted_columns"]

print(f"Dataset drift detected: {drift_detected}")
print(f"Drifted features: {drifted_features}/{results['metrics'][0]['result']['number_of_columns']}")

# Save HTML report for visual inspection
drift_report.save_html("drift_report.html")

Evidently automatically picks the right statistical test based on feature type and sample size. For numerical features with enough samples, it defaults to the Kolmogorov-Smirnov test. For categorical features, it uses chi-squared. You can override these, but the defaults are solid.

Statistical Tests Under the Hood

Understanding the tests helps you tune thresholds and debug false positives.

Kolmogorov-Smirnov (KS) test – compares the cumulative distribution functions of two samples. The test statistic is the maximum distance between the two CDFs. A p-value below your threshold (typically 0.05) means the distributions are statistically different. KS works well for continuous features but struggles with heavy ties in discrete data.

Population Stability Index (PSI) – measures how much a distribution has shifted. Unlike KS, PSI gives you a single magnitude number. The standard thresholds: PSI < 0.1 means no significant shift, 0.1-0.2 means moderate drift worth investigating, and PSI > 0.2 means significant drift requiring action.

PSI is my preferred metric for production monitoring because it’s intuitive, directional, and the thresholds are well-established across industry. Here’s how to compute it manually:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import numpy as np

def compute_psi(reference: np.ndarray, current: np.ndarray, bins: int = 10) -> float:
    """Compute Population Stability Index between two distributions."""
    # Create bins from reference distribution
    breakpoints = np.quantile(reference, np.linspace(0, 1, bins + 1))
    breakpoints = np.unique(breakpoints)  # handle duplicate quantiles

    # Count proportions in each bin
    ref_counts = np.histogram(reference, bins=breakpoints)[0]
    cur_counts = np.histogram(current, bins=breakpoints)[0]

    # Convert to proportions, add small epsilon to avoid log(0)
    eps = 1e-6
    ref_pct = ref_counts / ref_counts.sum() + eps
    cur_pct = cur_counts / cur_counts.sum() + eps

    psi = np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))
    return psi

# Example usage
ref_scores = np.random.normal(0.7, 0.1, size=10000)
prod_scores = np.random.normal(0.65, 0.15, size=5000)  # shifted distribution

psi_value = compute_psi(ref_scores, prod_scores)
print(f"PSI: {psi_value:.4f}")

if psi_value < 0.1:
    print("No significant drift")
elif psi_value < 0.2:
    print("Moderate drift -- investigate")
else:
    print("Significant drift -- retrain or roll back")

Jensen-Shannon divergence – a symmetric, bounded version of KL divergence. Ranges from 0 (identical) to 1 (completely different). Use this when you want a clean 0-1 metric for dashboards.

Setting Up a Drift Monitoring Pipeline

A one-off report is useful for debugging. For production, you need automated monitoring that runs on a schedule and alerts you when drift crosses thresholds.

Here’s a monitoring pipeline using Evidently’s test suite approach, which gives you pass/fail verdicts you can wire into alerting:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from evidently.test_suite import TestSuite
from evidently.tests import (
    TestShareOfDriftedColumns,
    TestColumnDrift,
)
import json
import requests
from datetime import datetime

def run_drift_checks(reference_path: str, current_path: str) -> dict:
    """Run drift tests and return structured results."""
    reference = pd.read_csv(reference_path)
    current = pd.read_csv(current_path)

    suite = TestSuite(tests=[
        # Fail if more than 30% of features drift
        TestShareOfDriftedColumns(lt=0.3),
        # Explicitly monitor critical features
        TestColumnDrift(column_name="user_age"),
        TestColumnDrift(column_name="transaction_amount"),
        TestColumnDrift(column_name="session_duration"),
    ])

    suite.run(reference_data=reference, current_data=current)
    result = suite.as_dict()

    return {
        "timestamp": datetime.utcnow().isoformat(),
        "overall_passed": result["summary"]["all_passed"],
        "total_tests": result["summary"]["total"],
        "failed_tests": result["summary"]["failed"],
        "tests": [
            {
                "name": t["name"],
                "status": t["status"],
                "description": t["description"],
            }
            for t in result["tests"]
        ],
    }

def send_alert(results: dict, webhook_url: str):
    """Send Slack alert when drift tests fail."""
    if results["overall_passed"]:
        return

    failed = [t for t in results["tests"] if t["status"] == "FAIL"]
    message = {
        "text": (
            f":warning: Drift detected at {results['timestamp']}\n"
            f"Failed {results['failed_tests']}/{results['total_tests']} tests:\n"
            + "\n".join(f"  - {t['name']}: {t['description']}" for t in failed)
        )
    }
    requests.post(webhook_url, json=message)

# Run as a scheduled job (cron, Airflow, Prefect, etc.)
results = run_drift_checks("data/reference.csv", "data/production_latest.csv")
send_alert(results, webhook_url="https://hooks.slack.com/services/YOUR/WEBHOOK/URL")

# Log results for historical tracking
with open(f"drift_logs/{results['timestamp']}.json", "w") as f:
    json.dump(results, f, indent=2)

Schedule this with cron, Airflow, or Prefect. For most use cases, daily checks are sufficient. If you’re processing high-volume real-time traffic, run it hourly against a sliding window.

Alerting Strategy That Actually Works

Don’t alert on every statistical test failure. You’ll drown in noise and start ignoring alerts within a week. Instead, build a tiered system:

Tier 1 – Log only: A single feature drifts with p-value between 0.01 and 0.05. This is informational. Log it, put it on a dashboard, move on.

Tier 2 – Slack notification: More than 30% of features drift, or a critical feature (one with high feature importance) drifts with p-value < 0.01. Someone should look at it within a day.

Tier 3 – PagerDuty / on-call alert: Prediction drift exceeds PSI 0.25 AND performance metrics (accuracy, precision, recall) have degraded against a holdout set. This means the model is actively producing worse results. Someone needs to act now.

The key insight: drift alone doesn’t mean the model is broken. Seasonal patterns, marketing campaigns, and product launches all cause legitimate distribution shifts. Only escalate when drift correlates with performance degradation.

Retraining Triggers

Detecting drift is half the problem. The other half is deciding when to retrain. Here are the three strategies, ranked by preference:

Performance-based triggers (best): Retrain when your evaluation metric drops below a threshold on a labeled holdout set or delayed ground truth. This is the gold standard because it directly measures what you care about. The downside is you need labels, which might arrive with a delay.

Drift-based triggers (good default): Retrain when PSI exceeds 0.2 across critical features for three consecutive monitoring windows. The consecutive requirement prevents retraining on transient spikes. This works when you don’t have fast access to ground truth labels.

Calendar-based triggers (last resort): Retrain weekly or monthly regardless of drift signals. Simple and predictable, but wasteful if the model hasn’t degraded and dangerous if drift happens between scheduled retrains.

My recommendation: use performance-based triggers as your primary signal and drift-based triggers as an early warning system. If you can’t get labels faster than monthly, drift monitoring becomes your primary defense.

Common Errors

ValueError: column not found in reference data – Your production data schema doesn’t match the reference data. New features were added or column names changed. Fix: validate schemas match before running drift detection. Use set(reference.columns) == set(current.columns) as a pre-check.

PSI returns inf or nan – A bin in your reference data has zero samples, causing division by zero in the log calculation. Fix: add an epsilon value (like 1e-6) to proportions before computing PSI, or use fewer bins so each bin has adequate samples.

TypeError: Cannot compare types 'float64' and 'object' – Mixed types in a column, often from nulls being read as strings. Fix: enforce dtypes before running tests with df[col] = pd.to_numeric(df[col], errors='coerce').

Evidently test suite passes but model performance is degraded – You’re testing the wrong features. Drift in low-importance features won’t affect predictions. Fix: focus drift monitoring on your top features by importance score. Use TestColumnDrift on specific columns rather than relying on the blanket DataDriftPreset.

False positive drift alerts after deploying a new feature pipeline – Legitimate schema changes trigger drift detection. Fix: update your reference dataset after intentional pipeline changes. Version your reference data alongside your model artifacts.