Your model worked great three months ago. Now accuracy is quietly tanking and nobody noticed until a customer complained. That’s drift – the slow poison of production ML. The training data no longer represents what the model sees in the real world, and predictions degrade without any code change or deployment event triggering an alert.
Detecting drift early is the difference between a quick retrain and a weeks-long fire drill. Here’s how to set it up properly.
Types of Drift You Need to Watch
There are three distinct failure modes, and they require different detection strategies:
Data drift (also called covariate shift) – the input feature distributions change. Maybe your e-commerce model trained on US traffic starts getting requests from a new market. The model’s inputs look different from training data, even if the relationship between inputs and outputs hasn’t changed.
Concept drift – the relationship between inputs and outputs changes. Your fraud model learned that transactions over $500 from new accounts are suspicious. Then the fraud pattern shifts to small recurring charges. The features look the same, but the correct labels have changed.
Prediction drift – the model’s output distribution shifts. Even without ground truth labels, you can monitor whether the model is predicting class A 80% of the time when it used to predict it 40% of the time. This is the cheapest signal to monitor because you don’t need labels.
My recommendation: monitor all three. Data drift and prediction drift are free (no labels needed). Concept drift requires delayed ground truth, but it’s the one that actually tells you the model is wrong.
Quick Drift Detection with Evidently AI
Evidently is the best open-source library for drift detection. It handles the statistical testing for you and generates reports you can plug into dashboards or CI pipelines.
Install it and run your first drift report in under 20 lines:
| |
Evidently automatically picks the right statistical test based on feature type and sample size. For numerical features with enough samples, it defaults to the Kolmogorov-Smirnov test. For categorical features, it uses chi-squared. You can override these, but the defaults are solid.
Statistical Tests Under the Hood
Understanding the tests helps you tune thresholds and debug false positives.
Kolmogorov-Smirnov (KS) test – compares the cumulative distribution functions of two samples. The test statistic is the maximum distance between the two CDFs. A p-value below your threshold (typically 0.05) means the distributions are statistically different. KS works well for continuous features but struggles with heavy ties in discrete data.
Population Stability Index (PSI) – measures how much a distribution has shifted. Unlike KS, PSI gives you a single magnitude number. The standard thresholds: PSI < 0.1 means no significant shift, 0.1-0.2 means moderate drift worth investigating, and PSI > 0.2 means significant drift requiring action.
PSI is my preferred metric for production monitoring because it’s intuitive, directional, and the thresholds are well-established across industry. Here’s how to compute it manually:
| |
Jensen-Shannon divergence – a symmetric, bounded version of KL divergence. Ranges from 0 (identical) to 1 (completely different). Use this when you want a clean 0-1 metric for dashboards.
Setting Up a Drift Monitoring Pipeline
A one-off report is useful for debugging. For production, you need automated monitoring that runs on a schedule and alerts you when drift crosses thresholds.
Here’s a monitoring pipeline using Evidently’s test suite approach, which gives you pass/fail verdicts you can wire into alerting:
| |
Schedule this with cron, Airflow, or Prefect. For most use cases, daily checks are sufficient. If you’re processing high-volume real-time traffic, run it hourly against a sliding window.
Alerting Strategy That Actually Works
Don’t alert on every statistical test failure. You’ll drown in noise and start ignoring alerts within a week. Instead, build a tiered system:
Tier 1 – Log only: A single feature drifts with p-value between 0.01 and 0.05. This is informational. Log it, put it on a dashboard, move on.
Tier 2 – Slack notification: More than 30% of features drift, or a critical feature (one with high feature importance) drifts with p-value < 0.01. Someone should look at it within a day.
Tier 3 – PagerDuty / on-call alert: Prediction drift exceeds PSI 0.25 AND performance metrics (accuracy, precision, recall) have degraded against a holdout set. This means the model is actively producing worse results. Someone needs to act now.
The key insight: drift alone doesn’t mean the model is broken. Seasonal patterns, marketing campaigns, and product launches all cause legitimate distribution shifts. Only escalate when drift correlates with performance degradation.
Retraining Triggers
Detecting drift is half the problem. The other half is deciding when to retrain. Here are the three strategies, ranked by preference:
Performance-based triggers (best): Retrain when your evaluation metric drops below a threshold on a labeled holdout set or delayed ground truth. This is the gold standard because it directly measures what you care about. The downside is you need labels, which might arrive with a delay.
Drift-based triggers (good default): Retrain when PSI exceeds 0.2 across critical features for three consecutive monitoring windows. The consecutive requirement prevents retraining on transient spikes. This works when you don’t have fast access to ground truth labels.
Calendar-based triggers (last resort): Retrain weekly or monthly regardless of drift signals. Simple and predictable, but wasteful if the model hasn’t degraded and dangerous if drift happens between scheduled retrains.
My recommendation: use performance-based triggers as your primary signal and drift-based triggers as an early warning system. If you can’t get labels faster than monthly, drift monitoring becomes your primary defense.
Common Errors
ValueError: column not found in reference data – Your production data schema doesn’t match the reference data. New features were added or column names changed. Fix: validate schemas match before running drift detection. Use set(reference.columns) == set(current.columns) as a pre-check.
PSI returns inf or nan – A bin in your reference data has zero samples, causing division by zero in the log calculation. Fix: add an epsilon value (like 1e-6) to proportions before computing PSI, or use fewer bins so each bin has adequate samples.
TypeError: Cannot compare types 'float64' and 'object' – Mixed types in a column, often from nulls being read as strings. Fix: enforce dtypes before running tests with df[col] = pd.to_numeric(df[col], errors='coerce').
Evidently test suite passes but model performance is degraded – You’re testing the wrong features. Drift in low-importance features won’t affect predictions. Fix: focus drift monitoring on your top features by importance score. Use TestColumnDrift on specific columns rather than relying on the blanket DataDriftPreset.
False positive drift alerts after deploying a new feature pipeline – Legitimate schema changes trigger drift detection. Fix: update your reference dataset after intentional pipeline changes. Version your reference data alongside your model artifacts.
Related Guides
- How to A/B Test LLM Prompts and Models in Production
- How to Build a Model Drift Alert Pipeline with Evidently and FastAPI
- How to Build a Model Rollback Pipeline with Health Checks
- How to Build a Model Compression Pipeline with Pruning and Quantization
- How to Build a Model Configuration Management Pipeline with Hydra
- How to Serve LLMs in Production with SGLang
- How to Build a Model Dependency Scanner and Vulnerability Checker
- How to Build a Model Feature Store Pipeline with Redis and FastAPI
- How to Route LLM Traffic by Cost and Complexity Using Intelligent Model Routing
- How to Build a Model Performance Alerting Pipeline with Webhooks