Set Up DVC with an S3 Remote

DVC tracks large files – models, datasets, artifacts – outside of Git. Git stores lightweight .dvc pointer files, while the actual binary blobs live in a remote like S3. When you check out any commit, dvc checkout restores the exact model that was produced at that point in time.

Start by installing DVC with S3 support and initializing it in a Git repo:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
pip install "dvc[s3]>=3.50"

mkdir model-versioning-demo && cd model-versioning-demo
git init
dvc init

# Configure the S3 remote
dvc remote add -d models s3://my-ml-models-bucket/dvc-store

# Commit the DVC setup
git add .dvc .dvcignore
git commit -m "init dvc with s3 remote"

DVC uses the standard AWS credential chain. If aws s3 ls works, DVC will too. For fine-grained config, you can set credentials per remote:

1
2
3
dvc remote modify models access_key_id AKIAIOSFODNN7EXAMPLE
dvc remote modify models secret_access_key wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
dvc remote modify --local models access_key_id AKIAIOSFODNN7EXAMPLE

Use the --local flag for secrets. That writes to .dvc/config.local, which is gitignored by default.

Write a Training Script

You need a real training script that produces a serialized model file. Here’s one that trains a RandomForestClassifier on the Iris dataset and dumps metrics alongside the model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# train.py
import json
import joblib
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

SEED = 42
N_ESTIMATORS = 100
MAX_DEPTH = 5

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=SEED
)

model = RandomForestClassifier(
    n_estimators=N_ESTIMATORS,
    max_depth=MAX_DEPTH,
    random_state=SEED,
)
model.fit(X_train, y_train)

preds = model.predict(X_test)
metrics = {
    "accuracy": float(accuracy_score(y_test, preds)),
    "f1_macro": float(f1_score(y_test, preds, average="macro")),
    "n_estimators": N_ESTIMATORS,
    "max_depth": MAX_DEPTH,
}

joblib.dump(model, "model.joblib")

with open("metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)

print(f"accuracy={metrics['accuracy']:.4f}  f1={metrics['f1_macro']:.4f}")

Run it once to verify it produces model.joblib and metrics.json:

1
2
python train.py
# accuracy=0.9667  f1=0.9665

Define a DVC Pipeline

A dvc.yaml file declares your pipeline stages – what runs, what it depends on, and what it produces. This is better than manually running dvc add after each training run because DVC handles caching, dependency tracking, and reproducibility for you.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# dvc.yaml
stages:
  train:
    cmd: python train.py
    deps:
      - train.py
    params:
      - train.py:
          - SEED
          - N_ESTIMATORS
          - MAX_DEPTH
    outs:
      - model.joblib
    metrics:
      - metrics.json:
          cache: false

The params section tells DVC to watch specific Python variables in train.py. If you change N_ESTIMATORS from 100 to 200, dvc repro detects the change and reruns the stage. The metrics section with cache: false keeps metrics.json in Git directly so you can diff metrics across commits without pulling from S3.

Run the pipeline and push artifacts to S3:

1
2
3
4
5
6
dvc repro
git add dvc.yaml dvc.lock metrics.json .gitignore
git commit -m "train v1: rf n_estimators=100 depth=5"
git tag v1.0.0

dvc push

dvc.lock records the exact hashes of every input and output. Combined with the git commit, you have a fully reproducible snapshot.

Switch Between Model Versions

This is where DVC shines. Every git commit points to a dvc.lock that knows the exact hash of model.joblib. Switching between versions is two commands:

1
2
3
4
5
6
7
# Go back to a previous model version
git checkout v1.0.0
dvc checkout

# model.joblib is now the exact file from v1.0.0
python -c "import joblib; m = joblib.load('model.joblib'); print(m.n_estimators)"
# 100

To iterate on a new version, go back to your branch, change the hyperparameters, and rerun:

1
2
git checkout main
dvc checkout

Edit train.py – change N_ESTIMATORS = 200 and MAX_DEPTH = 10 – then run the pipeline again:

1
2
3
4
5
6
dvc repro
dvc push

git add dvc.lock metrics.json train.py
git commit -m "train v2: rf n_estimators=200 depth=10"
git tag v2.0.0

Compare metrics between tags directly with DVC:

1
dvc metrics diff v1.0.0 v2.0.0

This prints a table showing how accuracy and F1 changed between the two versions.

Use the DVC Python API

If you want to load a model version programmatically – say, inside a serving endpoint or a CI job – you can pull artifacts without a full checkout:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# load_model.py
import dvc.api
import joblib
import tempfile
import os

# Get the S3 URL for model.joblib at a specific git tag
url = dvc.api.get_url("model.joblib", repo=".", rev="v1.0.0")
print(f"Model stored at: {url}")

# Read the raw bytes for a specific version
with dvc.api.open("model.joblib", repo=".", rev="v1.0.0", mode="rb") as f:
    model = joblib.load(f)

print(f"Loaded model with {model.n_estimators} estimators")
print(f"Classes: {model.classes_}")

# Fetch params used in that version
params = dvc.api.params_show(rev="v1.0.0")
print(f"Training params: {params}")

dvc.api.open streams the file directly from S3 without materializing it to disk. This is useful when you just need to load the model in a microservice and don’t want to clone the repo or run dvc pull.

For CI/CD, you can also use the CLI to fetch a single file:

1
2
# Pull just the model file from S3 at a specific tag
dvc get . model.joblib --rev v2.0.0 -o model_v2.joblib

Release Management with Git Tags

A clean release workflow pairs git tags with DVC pushes. Here’s a pattern that works well:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# After training and validating a new model
dvc repro
dvc push

# Commit everything
git add dvc.lock metrics.json train.py
git commit -m "train v3: rf n_estimators=300 depth=8"

# Tag with semver and annotate with metrics
git tag -a v3.0.0 -m "accuracy=0.9833 f1=0.9831 n_estimators=300"

# List all model versions
git tag -l "v*" --sort=-version:refname

In a team setting, you push tags to the shared repo and DVC artifacts to S3. Anyone can then pull a specific version:

1
2
3
git checkout v3.0.0
dvc pull
# model.joblib is now v3.0.0's artifact

Garbage collection keeps your S3 costs under control. DVC can clean up artifacts not referenced by any current branch or tag:

1
dvc gc --workspace --cloud

This removes cached files from both local storage and S3 that aren’t needed by the current workspace state.

Common Errors and Fixes

ERROR: failed to push ... 403 Forbidden – Your AWS credentials don’t have s3:PutObject permission on the bucket. Check the IAM policy attached to the user or role. You need at least s3:GetObject, s3:PutObject, s3:ListBucket, and s3:DeleteObject.

ERROR: output 'model.joblib' is already tracked by Git – You ran git add model.joblib before DVC could manage it. Remove it from Git tracking first:

1
2
3
git rm --cached model.joblib
echo "model.joblib" >> .gitignore
dvc add model.joblib

dvc repro says “Stage ’train’ didn’t change” – DVC checks file hashes and params. If nothing changed, it won’t rerun. Force it with dvc repro --force. Or verify that the param you changed is actually listed in the params section of dvc.yaml.

ERROR: failed to pull ... Cache 'abc123' not found – Someone committed a dvc.lock without running dvc push. The hash exists in the lock file but the actual artifact never made it to S3. The author needs to run dvc push from their machine.

Slow pushes/pulls – DVC uploads files one at a time by default. Speed it up with parallel transfers:

1
dvc remote modify models jobs 8