Why a Model CI Pipeline Matters#
Every ML team hits the same wall: someone trains a model locally, drops a pickle file in Slack, and nobody can reproduce the results two weeks later. A model CI pipeline fixes this. You define your train and evaluate steps in a dvc.yaml file, and GitHub Actions runs dvc repro on every pull request. CML (Continuous Machine Learning) posts the metrics diff right on the PR so reviewers see exactly how accuracy changed.
Here’s the project structure you’ll end up with:
1
2
3
4
5
6
7
8
9
10
11
| my-ml-project/
├── .github/workflows/model-ci.yml
├── data/
│ └── train.csv # tracked by DVC
├── src/
│ ├── train.py
│ └── evaluate.py
├── dvc.yaml
├── dvc.lock
├── params.yaml
└── requirements.txt
|
Set Up DVC to Track Data and Models#
Start by initializing DVC in your Git repo and configuring a remote. If you already have DVC set up, skip to the next section.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| pip install dvc dvc-s3
cd my-ml-project
dvc init
# Point DVC to your S3 bucket (or GCS, Azure, etc.)
dvc remote add -d storage s3://my-ml-bucket/dvc-store
# Track your training data
dvc add data/train.csv
# Commit the pointer files
git add data/train.csv.dvc data/.gitignore .dvc/ dvc.lock
git commit -m "track training data with dvc"
dvc push
|
DVC creates a .dvc pointer file for train.csv and stores the actual data in S3. Git never sees the raw data – just the hash reference.
Define a DVC Pipeline#
The dvc.yaml file declares your pipeline stages. Each stage specifies its dependencies, command, outputs, and metrics. DVC uses this to figure out what needs rerunning when inputs change.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| # dvc.yaml
stages:
train:
cmd: python src/train.py
deps:
- src/train.py
- data/train.csv
- params.yaml
params:
- train.n_estimators
- train.max_depth
- train.test_size
outs:
- models/model.pkl
plots:
- results/training_loss.csv:
x: epoch
y: loss
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/train.csv
metrics:
- results/metrics.json:
cache: false
plots:
- results/confusion_matrix.csv:
x: predicted
y: actual
|
Your params.yaml holds the hyperparameters:
1
2
3
4
5
| # params.yaml
train:
n_estimators: 200
max_depth: 10
test_size: 0.2
|
And here’s a minimal src/train.py that reads those params:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| # src/train.py
import json
import os
import yaml
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pickle
# Load params
with open("params.yaml") as f:
params = yaml.safe_load(f)["train"]
# Load data
df = pd.read_csv("data/train.csv")
X = df.drop(columns=["target"])
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=params["test_size"], random_state=42
)
# Train
clf = RandomForestClassifier(
n_estimators=params["n_estimators"],
max_depth=params["max_depth"],
random_state=42,
)
clf.fit(X_train, y_train)
# Save model
os.makedirs("models", exist_ok=True)
with open("models/model.pkl", "wb") as f:
pickle.dump(clf, f)
# Save training loss placeholder (sklearn doesn't have epoch losses)
os.makedirs("results", exist_ok=True)
pd.DataFrame({"epoch": [1], "loss": [1 - clf.score(X_train, y_train)]}).to_csv(
"results/training_loss.csv", index=False
)
|
The evaluate script loads the model and writes metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
| # src/evaluate.py
import json
import os
import pickle
import pandas as pd
import yaml
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
with open("params.yaml") as f:
params = yaml.safe_load(f)["train"]
df = pd.read_csv("data/train.csv")
X = df.drop(columns=["target"])
y = df["target"]
_, X_test, _, y_test = train_test_split(
X, y, test_size=params["test_size"], random_state=42
)
with open("models/model.pkl", "rb") as f:
clf = pickle.load(f)
preds = clf.predict(X_test)
os.makedirs("results", exist_ok=True)
# Write metrics
metrics = {
"accuracy": round(accuracy_score(y_test, preds), 4),
"f1_score": round(f1_score(y_test, preds, average="weighted"), 4),
}
with open("results/metrics.json", "w") as f:
json.dump(metrics, f, indent=2)
# Write confusion matrix for plotting
labels = sorted(y.unique())
cm = confusion_matrix(y_test, preds, labels=labels)
rows = []
for i, actual_label in enumerate(labels):
for j, pred_label in enumerate(labels):
rows.append({"actual": actual_label, "predicted": pred_label, "count": int(cm[i][j])})
pd.DataFrame(rows).to_csv("results/confusion_matrix.csv", index=False)
|
Run the pipeline locally to make sure it works:
DVC executes train then evaluate, caches the outputs, and writes dvc.lock with the exact hashes. Commit everything:
1
2
| git add dvc.yaml dvc.lock params.yaml results/metrics.json
git commit -m "add dvc pipeline"
|
GitHub Actions Workflow with CML#
This is where it all comes together. The workflow runs dvc repro on every PR, compares metrics against the base branch, and posts a report as a PR comment using CML.
Create .github/workflows/model-ci.yml:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
| # .github/workflows/model-ci.yml
name: Model CI
on:
pull_request:
branches: [main]
permissions:
contents: read
pull-requests: write
jobs:
train-evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # needed for dvc diff against base branch
- uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install dvc dvc-s3
- uses: iterative/setup-cml@v2
- name: Pull DVC data
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: dvc pull
- name: Reproduce pipeline
run: dvc repro
- name: Generate CML report
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "## Model CI Report" >> report.md
echo "" >> report.md
# Metrics comparison table
echo "### Metrics (this branch vs main)" >> report.md
dvc metrics diff --md >> report.md || echo "No previous metrics to compare." >> report.md
echo "" >> report.md
# Parameter changes
echo "### Parameter Changes" >> report.md
dvc params diff --md >> report.md || echo "No parameter changes." >> report.md
echo "" >> report.md
# Inline plots
echo "### Confusion Matrix" >> report.md
cml asset publish results/confusion_matrix.csv --md >> report.md || true
echo "" >> report.md
echo "### Training Loss" >> report.md
cml asset publish results/training_loss.csv --md >> report.md || true
echo "" >> report.md
# Post the report as a PR comment
cml comment create report.md
|
A few things to note about this workflow:
fetch-depth: 0 is required so DVC can compare metrics between the PR branch and main. Without it, dvc metrics diff has nothing to diff against.dvc pull fetches the actual data files from your remote storage. You need the AWS credentials (or equivalent for GCS/Azure) stored as GitHub Secrets.dvc repro runs only the stages whose dependencies changed. If you only tweaked params.yaml, it reruns train and evaluate but skips everything else.cml comment create posts the entire report.md as a comment on the PR. Reviewers see accuracy, F1, and parameter diffs without leaving GitHub.
Comparing Metrics Between Branches#
The real power shows up when you open a PR that changes hyperparameters. Say you create a branch and bump n_estimators from 200 to 500:
1
| git checkout -b experiment/more-trees
|
Edit params.yaml:
1
2
3
4
5
| # params.yaml
train:
n_estimators: 500
max_depth: 10
test_size: 0.2
|
Push the branch, open a PR, and the workflow fires. The dvc metrics diff output in your PR comment looks like:
1
2
3
4
| | Path | Metric | HEAD | main | Change |
|-----------------------|----------|--------|--------|---------|
| results/metrics.json | accuracy | 0.9312 | 0.9185 | 0.0127 |
| results/metrics.json | f1_score | 0.9298 | 0.9170 | 0.0128 |
|
Reviewers see at a glance that accuracy improved by 1.27%. The parameter diff shows exactly what changed. No more guessing which knobs were turned.
You can also run dvc plots diff locally to generate HTML comparison plots:
1
| dvc plots diff main --open
|
This opens a browser with side-by-side plots for the confusion matrix and training loss between your branch and main.
Common Errors and Fixes#
ERROR: failed to reproduce 'train': data/train.csv not found
You forgot to run dvc pull before dvc repro. The data files aren’t in your working directory – they’re in remote storage. Add dvc pull to your workflow before the reproduce step, and make sure your cloud credentials are set.
dvc metrics diff returns empty output
This happens when fetch-depth is set to 1 (the default for actions/checkout). DVC needs the full Git history to find the metrics file on the base branch. Set fetch-depth: 0 in your checkout step.
cml comment create fails with 403
CML needs write access to pull requests. Add these permissions to your workflow:
1
2
3
| permissions:
contents: read
pull-requests: write
|
Also make sure you’re passing REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} to the step.
ERROR: failed to push data to remote
Check that your S3 bucket name and credentials are correct. Run dvc remote list to verify the remote URL. A common mistake is using a bucket name that doesn’t exist or using credentials that only have read access. Your IAM user needs s3:PutObject and s3:GetObject permissions on the bucket.
Pipeline runs every stage even when nothing changed
DVC tracks dependencies by their MD5 hash. If you’re running inside a container and file timestamps differ, DVC may think files changed. Pin your dvc.lock in Git and avoid touching tracked files outside of dvc repro.