How to Monitor ML Model Performance with Evidently AI

Models degrade in production. Features shift, upstream data pipelines break, and prediction quality drops before anyone notices. Evidently AI gives you a straightforward way to detect these problems with Reports (visual analysis) and Test Suites (pass/fail checks you can wire into CI/CD).

Here’s how to get monitoring running in under 10 minutes.

Install Evidently and Generate Your First Report

1
pip install evidently

Create a data quality report with realistic data right away:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import pandas as pd
import numpy as np
from evidently.report import Report
from evidently.metric_preset import DataQualityPreset

# Simulate reference (training) and current (production) data
np.random.seed(42)
reference = pd.DataFrame({
    "age": np.random.normal(35, 10, 1000).astype(int),
    "income": np.random.normal(55000, 15000, 1000),
    "credit_score": np.random.normal(700, 50, 1000).astype(int),
    "approved": np.random.choice([0, 1], 1000, p=[0.3, 0.7]),
})

current = pd.DataFrame({
    "age": np.random.normal(38, 12, 500).astype(int),
    "income": np.random.normal(52000, 18000, 500),
    "credit_score": np.random.normal(680, 60, 500).astype(int),
    "approved": np.random.choice([0, 1], 500, p=[0.4, 0.6]),
})

report = Report(metrics=[DataQualityPreset()])
report.run(reference_data=reference, current_data=current)
report.save_html("data_quality_report.html")

Open data_quality_report.html in your browser. You get missing values, duplicates, feature distributions, and correlations – all in one shot.

Create Drift Detection Reports

Drift detection is the core reason most teams adopt Evidently. You can check for data drift, target drift, or both at once.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset

drift_report = Report(metrics=[
    DataDriftPreset(),
    TargetDriftPreset(),
])

drift_report.run(reference_data=reference, current_data=current)
drift_report.save_html("drift_report.html")

# Programmatic access to results
results = drift_report.as_dict()
dataset_drift = results["metrics"][0]["result"]["dataset_drift"]
print(f"Dataset drift detected: {dataset_drift}")

DataDriftPreset runs statistical tests on every feature column. It automatically picks the right test based on column type – Jensen-Shannon divergence for numerical features, chi-squared for categorical. You can override this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

drift_report = Report(metrics=[
    DataDriftPreset(
        columns=["age", "income", "credit_score"],
        stattest="psi",
        stattest_threshold=0.1,
    )
])
drift_report.run(reference_data=reference, current_data=current)

The stattest parameter accepts "psi" (Population Stability Index), "ks" (Kolmogorov-Smirnov), "wasserstein", "jensenshannon", and several others. PSI is a good default for production monitoring because it’s well-understood by risk teams and has clear threshold conventions (under 0.1 is stable, 0.1-0.2 is moderate, above 0.2 is significant).

Build Custom Metric Presets for Classification and Regression

For classification models, combine quality metrics with drift detection:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from evidently.report import Report
from evidently.metric_preset import ClassificationPreset, DataDriftPreset

# Add prediction column to your data
reference["prediction"] = reference["approved"]
current["prediction"] = current["approved"]

from evidently import ColumnMapping

column_mapping = ColumnMapping(
    target="approved",
    prediction="prediction",
)

clf_report = Report(metrics=[
    ClassificationPreset(),
    DataDriftPreset(),
])
clf_report.run(
    reference_data=reference,
    current_data=current,
    column_mapping=column_mapping,
)
clf_report.save_html("classification_report.html")

ClassificationPreset gives you accuracy, precision, recall, F1, confusion matrix, and PR/ROC curves in one pass. For regression, swap in RegressionPreset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from evidently.metric_preset import RegressionPreset

# For a regression use case
reg_column_mapping = ColumnMapping(
    target="income",
    prediction="predicted_income",
)

reg_report = Report(metrics=[RegressionPreset()])
reg_report.run(
    reference_data=reg_reference,
    current_data=reg_current,
    column_mapping=reg_column_mapping,
)

This produces MAE, RMSE, MAPE, error distributions, and residual plots.

Set Up Test Suites with Pass/Fail Thresholds

Reports are for exploration. Test Suites are for automation – they return pass or fail, which makes them perfect for CI/CD gates.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from evidently.test_suite import TestSuite
from evidently.test_preset import DataStabilityTestPreset, DataQualityTestPreset
from evidently.tests import (
    TestShareOfDriftedColumns,
    TestColumnDrift,
    TestShareOfMissingValues,
)

suite = TestSuite(tests=[
    DataStabilityTestPreset(),
    DataQualityTestPreset(),
    TestShareOfDriftedColumns(lt=0.3),
    TestColumnDrift(column_name="credit_score"),
    TestShareOfMissingValues(lt=0.05),
])

suite.run(reference_data=reference, current_data=current)
suite.save_html("test_results.html")

# Check if all tests passed
test_results = suite.as_dict()
all_passed = all(
    t["status"] == "SUCCESS"
    for t in test_results["tests"]
)
print(f"All tests passed: {all_passed}")

DataStabilityTestPreset checks for new columns, missing columns, changed types, and out-of-range values. DataQualityTestPreset validates missing values, duplicates, and constant columns. The custom tests let you set explicit thresholds – lt=0.3 means fewer than 30% of columns can show drift.

You can also set test criticality so non-critical failures produce warnings instead of hard fails:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from evidently.tests import TestColumnShareOfMissingValues

suite = TestSuite(tests=[
    TestColumnShareOfMissingValues(
        column_name="credit_score",
        lt=0.01,
        is_critical=True,
    ),
    TestColumnShareOfMissingValues(
        column_name="age",
        lt=0.05,
        is_critical=False,  # warning only
    ),
])

Integrate Evidently into CI/CD Pipelines

Wire the test suite into a GitHub Actions workflow to block deployments when data quality drops:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# .github/workflows/model-monitoring.yml
name: Model Quality Gate
on:
  schedule:
    - cron: "0 6 * * *"  # daily at 6 AM
  workflow_dispatch:

jobs:
  monitoring:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install evidently pandas scikit-learn

      - name: Run monitoring checks
        run: python scripts/run_monitoring.py

      - name: Upload reports
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: monitoring-reports
          path: reports/

The run_monitoring.py script loads fresh production data, runs your test suite, and exits with a non-zero code on failure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import sys
import pandas as pd
from evidently.test_suite import TestSuite
from evidently.test_preset import DataStabilityTestPreset
from evidently.tests import TestShareOfDriftedColumns

reference = pd.read_parquet("data/reference.parquet")
current = pd.read_parquet("data/current_batch.parquet")

suite = TestSuite(tests=[
    DataStabilityTestPreset(),
    TestShareOfDriftedColumns(lt=0.3),
])
suite.run(reference_data=reference, current_data=current)
suite.save_html("reports/monitoring_report.html")

results = suite.as_dict()
failed = any(t["status"] == "FAIL" for t in results["tests"])

if failed:
    print("MONITORING FAILED: Data quality issues detected")
    sys.exit(1)

print("All monitoring checks passed")

This is the pattern that catches drift before it hits users. Schedule it daily or trigger it whenever new data lands.

Run Evidently as a Monitoring Service

For ongoing monitoring with a visual dashboard, Evidently provides a self-hosted UI. You create a Workspace, add projects, and send report snapshots over time.

1
pip install evidently

Set up the workspace and project:

1
2
3
4
5
6
7
8
9
from evidently.ui.workspace import Workspace

# Create a local workspace
ws = Workspace.create("my_monitoring")

# Create a project for your model
project = ws.create_project("Credit Approval Model")
project.description = "Production monitoring for the credit approval classifier"
project.save()

Then, in your batch monitoring script, generate and save snapshots:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, DataQualityPreset

report = Report(metrics=[
    DataDriftPreset(),
    DataQualityPreset(),
])
report.run(reference_data=reference, current_data=current)

# Save snapshot to workspace
from datetime import datetime
ws.add_report(project.id, report, timestamp=datetime.now())

Launch the dashboard:

1
evidently ui --workspace ./my_monitoring

Open http://localhost:8000 and you get a time-series dashboard showing how your metrics evolve across snapshots. Each project gets its own page with panels you can customize.

For a quick demo without any setup:

1
evidently ui --demo-projects all

This spins up the UI with sample projects so you can explore the interface before wiring in your own data.

Common Errors and Fixes

ValueError: reference_data should be provided – Drift detection requires reference data. If you only have current data, use DataQualityPreset instead, which works without a reference:

1
2
report = Report(metrics=[DataQualityPreset()])
report.run(reference_data=None, current_data=current)

KeyError: column not found – Your column mapping references a column that doesn’t exist in the DataFrame. Double-check column names with df.columns.tolist() before running reports.

TypeError: Cannot compare types – This happens when a column has mixed types (strings and numbers). Clean your data before feeding it to Evidently:

1
df["age"] = pd.to_numeric(df["age"], errors="coerce")

Test suite reports all FAIL with no reference – DataStabilityTestPreset relies on reference data to set baselines. Without it, every range check fails because there are no expected bounds. Always provide reference data for stability tests.

HTML report won’t render – If save_html() produces a blank page, you’re likely running in a headless environment. The HTML is self-contained and doesn’t need a server – just open it in any browser. If you need programmatic results, use as_dict() or json() instead.

Import errors after upgrading – Evidently v0.5+ restructured some internals. If you see ImportError: cannot import name 'Dashboard', you’re mixing the old Dashboard API (pre-0.2) with the current Report API. Stick with from evidently.report import Report and from evidently.metric_preset import DataDriftPreset.

Install Evidently and Generate Your First Report#

Create Drift Detection Reports#

Build Custom Metric Presets for Classification and Regression#

Set Up Test Suites with Pass/Fail Thresholds#

Integrate Evidently into CI/CD Pipelines#

Run Evidently as a Monitoring Service#

Common Errors and Fixes#

Related Guides#

About the Author