Set Up DVC with an S3 Remote
DVC tracks large files – models, datasets, artifacts – outside of Git. Git stores lightweight .dvc pointer files, while the actual binary blobs live in a remote like S3. When you check out any commit, dvc checkout restores the exact model that was produced at that point in time.
Start by installing DVC with S3 support and initializing it in a Git repo:
| |
DVC uses the standard AWS credential chain. If aws s3 ls works, DVC will too. For fine-grained config, you can set credentials per remote:
| |
Use the --local flag for secrets. That writes to .dvc/config.local, which is gitignored by default.
Write a Training Script
You need a real training script that produces a serialized model file. Here’s one that trains a RandomForestClassifier on the Iris dataset and dumps metrics alongside the model:
| |
Run it once to verify it produces model.joblib and metrics.json:
| |
Define a DVC Pipeline
A dvc.yaml file declares your pipeline stages – what runs, what it depends on, and what it produces. This is better than manually running dvc add after each training run because DVC handles caching, dependency tracking, and reproducibility for you.
| |
The params section tells DVC to watch specific Python variables in train.py. If you change N_ESTIMATORS from 100 to 200, dvc repro detects the change and reruns the stage. The metrics section with cache: false keeps metrics.json in Git directly so you can diff metrics across commits without pulling from S3.
Run the pipeline and push artifacts to S3:
| |
dvc.lock records the exact hashes of every input and output. Combined with the git commit, you have a fully reproducible snapshot.
Switch Between Model Versions
This is where DVC shines. Every git commit points to a dvc.lock that knows the exact hash of model.joblib. Switching between versions is two commands:
| |
To iterate on a new version, go back to your branch, change the hyperparameters, and rerun:
| |
Edit train.py – change N_ESTIMATORS = 200 and MAX_DEPTH = 10 – then run the pipeline again:
| |
Compare metrics between tags directly with DVC:
| |
This prints a table showing how accuracy and F1 changed between the two versions.
Use the DVC Python API
If you want to load a model version programmatically – say, inside a serving endpoint or a CI job – you can pull artifacts without a full checkout:
| |
dvc.api.open streams the file directly from S3 without materializing it to disk. This is useful when you just need to load the model in a microservice and don’t want to clone the repo or run dvc pull.
For CI/CD, you can also use the CLI to fetch a single file:
| |
Release Management with Git Tags
A clean release workflow pairs git tags with DVC pushes. Here’s a pattern that works well:
| |
In a team setting, you push tags to the shared repo and DVC artifacts to S3. Anyone can then pull a specific version:
| |
Garbage collection keeps your S3 costs under control. DVC can clean up artifacts not referenced by any current branch or tag:
| |
This removes cached files from both local storage and S3 that aren’t needed by the current workspace state.
Common Errors and Fixes
ERROR: failed to push ... 403 Forbidden – Your AWS credentials don’t have s3:PutObject permission on the bucket. Check the IAM policy attached to the user or role. You need at least s3:GetObject, s3:PutObject, s3:ListBucket, and s3:DeleteObject.
ERROR: output 'model.joblib' is already tracked by Git – You ran git add model.joblib before DVC could manage it. Remove it from Git tracking first:
| |
dvc repro says “Stage ’train’ didn’t change” – DVC checks file hashes and params. If nothing changed, it won’t rerun. Force it with dvc repro --force. Or verify that the param you changed is actually listed in the params section of dvc.yaml.
ERROR: failed to pull ... Cache 'abc123' not found – Someone committed a dvc.lock without running dvc push. The hash exists in the lock file but the actual artifact never made it to S3. The author needs to run dvc push from their machine.
Slow pushes/pulls – DVC uploads files one at a time by default. Speed it up with parallel transfers:
| |
Related Guides
- How to Build a Model Configuration Management Pipeline with Hydra
- How to Build a Model Compression Pipeline with Pruning and Quantization
- How to Build a Model Dependency Scanner and Vulnerability Checker
- How to Build a Model Feature Store Pipeline with Redis and FastAPI
- How to Build a Model Performance Alerting Pipeline with Webhooks
- How to Build a Model Health Dashboard with FastAPI and SQLite
- How to Build a Model Batch Inference Pipeline with Ray and Parquet
- How to Build a Model Metadata Store with SQLite and FastAPI
- How to Build a Model Input Validation Pipeline with Pydantic and FastAPI
- How to Build a Model Drift Alert Pipeline with Evidently and FastAPI