Why a Model CI Pipeline Matters

Every ML team hits the same wall: someone trains a model locally, drops a pickle file in Slack, and nobody can reproduce the results two weeks later. A model CI pipeline fixes this. You define your train and evaluate steps in a dvc.yaml file, and GitHub Actions runs dvc repro on every pull request. CML (Continuous Machine Learning) posts the metrics diff right on the PR so reviewers see exactly how accuracy changed.

Here’s the project structure you’ll end up with:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
my-ml-project/
├── .github/workflows/model-ci.yml
├── data/
│   └── train.csv          # tracked by DVC
├── src/
│   ├── train.py
│   └── evaluate.py
├── dvc.yaml
├── dvc.lock
├── params.yaml
└── requirements.txt

Set Up DVC to Track Data and Models

Start by initializing DVC in your Git repo and configuring a remote. If you already have DVC set up, skip to the next section.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
pip install dvc dvc-s3
cd my-ml-project
dvc init

# Point DVC to your S3 bucket (or GCS, Azure, etc.)
dvc remote add -d storage s3://my-ml-bucket/dvc-store

# Track your training data
dvc add data/train.csv

# Commit the pointer files
git add data/train.csv.dvc data/.gitignore .dvc/ dvc.lock
git commit -m "track training data with dvc"
dvc push

DVC creates a .dvc pointer file for train.csv and stores the actual data in S3. Git never sees the raw data – just the hash reference.

Define a DVC Pipeline

The dvc.yaml file declares your pipeline stages. Each stage specifies its dependencies, command, outputs, and metrics. DVC uses this to figure out what needs rerunning when inputs change.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# dvc.yaml
stages:
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/train.csv
      - params.yaml
    params:
      - train.n_estimators
      - train.max_depth
      - train.test_size
    outs:
      - models/model.pkl
    plots:
      - results/training_loss.csv:
          x: epoch
          y: loss

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/train.csv
    metrics:
      - results/metrics.json:
          cache: false
    plots:
      - results/confusion_matrix.csv:
          x: predicted
          y: actual

Your params.yaml holds the hyperparameters:

1
2
3
4
5
# params.yaml
train:
  n_estimators: 200
  max_depth: 10
  test_size: 0.2

And here’s a minimal src/train.py that reads those params:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# src/train.py
import json
import os
import yaml
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pickle

# Load params
with open("params.yaml") as f:
    params = yaml.safe_load(f)["train"]

# Load data
df = pd.read_csv("data/train.csv")
X = df.drop(columns=["target"])
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=params["test_size"], random_state=42
)

# Train
clf = RandomForestClassifier(
    n_estimators=params["n_estimators"],
    max_depth=params["max_depth"],
    random_state=42,
)
clf.fit(X_train, y_train)

# Save model
os.makedirs("models", exist_ok=True)
with open("models/model.pkl", "wb") as f:
    pickle.dump(clf, f)

# Save training loss placeholder (sklearn doesn't have epoch losses)
os.makedirs("results", exist_ok=True)
pd.DataFrame({"epoch": [1], "loss": [1 - clf.score(X_train, y_train)]}).to_csv(
    "results/training_loss.csv", index=False
)

The evaluate script loads the model and writes metrics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# src/evaluate.py
import json
import os
import pickle
import pandas as pd
import yaml
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split

with open("params.yaml") as f:
    params = yaml.safe_load(f)["train"]

df = pd.read_csv("data/train.csv")
X = df.drop(columns=["target"])
y = df["target"]

_, X_test, _, y_test = train_test_split(
    X, y, test_size=params["test_size"], random_state=42
)

with open("models/model.pkl", "rb") as f:
    clf = pickle.load(f)

preds = clf.predict(X_test)

os.makedirs("results", exist_ok=True)

# Write metrics
metrics = {
    "accuracy": round(accuracy_score(y_test, preds), 4),
    "f1_score": round(f1_score(y_test, preds, average="weighted"), 4),
}
with open("results/metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)

# Write confusion matrix for plotting
labels = sorted(y.unique())
cm = confusion_matrix(y_test, preds, labels=labels)
rows = []
for i, actual_label in enumerate(labels):
    for j, pred_label in enumerate(labels):
        rows.append({"actual": actual_label, "predicted": pred_label, "count": int(cm[i][j])})
pd.DataFrame(rows).to_csv("results/confusion_matrix.csv", index=False)

Run the pipeline locally to make sure it works:

1
dvc repro

DVC executes train then evaluate, caches the outputs, and writes dvc.lock with the exact hashes. Commit everything:

1
2
git add dvc.yaml dvc.lock params.yaml results/metrics.json
git commit -m "add dvc pipeline"

GitHub Actions Workflow with CML

This is where it all comes together. The workflow runs dvc repro on every PR, compares metrics against the base branch, and posts a report as a PR comment using CML.

Create .github/workflows/model-ci.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# .github/workflows/model-ci.yml
name: Model CI

on:
  pull_request:
    branches: [main]

permissions:
  contents: read
  pull-requests: write

jobs:
  train-evaluate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # needed for dvc diff against base branch

      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: pip

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install dvc dvc-s3

      - uses: iterative/setup-cml@v2

      - name: Pull DVC data
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc pull

      - name: Reproduce pipeline
        run: dvc repro

      - name: Generate CML report
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          echo "## Model CI Report" >> report.md
          echo "" >> report.md

          # Metrics comparison table
          echo "### Metrics (this branch vs main)" >> report.md
          dvc metrics diff --md >> report.md || echo "No previous metrics to compare." >> report.md
          echo "" >> report.md

          # Parameter changes
          echo "### Parameter Changes" >> report.md
          dvc params diff --md >> report.md || echo "No parameter changes." >> report.md
          echo "" >> report.md

          # Inline plots
          echo "### Confusion Matrix" >> report.md
          cml asset publish results/confusion_matrix.csv --md >> report.md || true
          echo "" >> report.md

          echo "### Training Loss" >> report.md
          cml asset publish results/training_loss.csv --md >> report.md || true
          echo "" >> report.md

          # Post the report as a PR comment
          cml comment create report.md

A few things to note about this workflow:

  • fetch-depth: 0 is required so DVC can compare metrics between the PR branch and main. Without it, dvc metrics diff has nothing to diff against.
  • dvc pull fetches the actual data files from your remote storage. You need the AWS credentials (or equivalent for GCS/Azure) stored as GitHub Secrets.
  • dvc repro runs only the stages whose dependencies changed. If you only tweaked params.yaml, it reruns train and evaluate but skips everything else.
  • cml comment create posts the entire report.md as a comment on the PR. Reviewers see accuracy, F1, and parameter diffs without leaving GitHub.

Comparing Metrics Between Branches

The real power shows up when you open a PR that changes hyperparameters. Say you create a branch and bump n_estimators from 200 to 500:

1
git checkout -b experiment/more-trees

Edit params.yaml:

1
2
3
4
5
# params.yaml
train:
  n_estimators: 500
  max_depth: 10
  test_size: 0.2

Push the branch, open a PR, and the workflow fires. The dvc metrics diff output in your PR comment looks like:

1
2
3
4
| Path                  | Metric   | HEAD   | main   | Change  |
|-----------------------|----------|--------|--------|---------|
| results/metrics.json  | accuracy | 0.9312 | 0.9185 | 0.0127  |
| results/metrics.json  | f1_score | 0.9298 | 0.9170 | 0.0128  |

Reviewers see at a glance that accuracy improved by 1.27%. The parameter diff shows exactly what changed. No more guessing which knobs were turned.

You can also run dvc plots diff locally to generate HTML comparison plots:

1
dvc plots diff main --open

This opens a browser with side-by-side plots for the confusion matrix and training loss between your branch and main.

Common Errors and Fixes

ERROR: failed to reproduce 'train': data/train.csv not found

You forgot to run dvc pull before dvc repro. The data files aren’t in your working directory – they’re in remote storage. Add dvc pull to your workflow before the reproduce step, and make sure your cloud credentials are set.

dvc metrics diff returns empty output

This happens when fetch-depth is set to 1 (the default for actions/checkout). DVC needs the full Git history to find the metrics file on the base branch. Set fetch-depth: 0 in your checkout step.

cml comment create fails with 403

CML needs write access to pull requests. Add these permissions to your workflow:

1
2
3
permissions:
  contents: read
  pull-requests: write

Also make sure you’re passing REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} to the step.

ERROR: failed to push data to remote

Check that your S3 bucket name and credentials are correct. Run dvc remote list to verify the remote URL. A common mistake is using a bucket name that doesn’t exist or using credentials that only have read access. Your IAM user needs s3:PutObject and s3:GetObject permissions on the bucket.

Pipeline runs every stage even when nothing changed

DVC tracks dependencies by their MD5 hash. If you’re running inside a container and file timestamps differ, DVC may think files changed. Pin your dvc.lock in Git and avoid touching tracked files outside of dvc repro.