Quick Setup and First Tracked Dataset
DVC (Data Version Control) works like Git for data. You keep your code in Git and your large files – datasets, models, artifacts – in a separate storage backend, with .dvc files acting as lightweight pointers that Git tracks. When you check out a commit, DVC knows exactly which version of your data belongs with that code.
Install DVC 3.x and initialize it in an existing Git repo:
| |
dvc init creates a .dvc/ directory with config files and a cache. Commit this right away:
| |
Now add a dataset. Suppose you have a data/ directory with training CSVs:
| |
That’s it. Your data files live in the DVC cache, and Git only tracks the small .dvc file containing the MD5 hash of your data. The actual files stay out of Git history entirely.
Configuring Remote Storage
A DVC remote is where your data lives when it’s not on your local machine. Think of it like a Git remote, but for data. You need one if you want to share datasets across machines or with teammates.
S3
| |
DVC picks up AWS credentials from the standard chain: environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), ~/.aws/credentials, or an IAM role. No extra config needed if your AWS CLI already works.
Google Cloud Storage
| |
Authentication uses GOOGLE_APPLICATION_CREDENTIALS or whatever gcloud auth has configured.
Local / Network Storage
For quick setups or air-gapped environments, point to a local directory or NFS mount:
| |
The -d flag sets this as the default remote. Commit the config after adding it:
| |
If you need to store credentials separately from the shared config, use --local:
| |
This writes to .dvc/config.local, which is .gitignored by default. Never commit credentials into .dvc/config.
Pushing and Pulling Data
Once a remote is configured, push your cached data to it:
| |
On another machine (or after a fresh git clone), pull data back:
| |
dvc pull is equivalent to running dvc fetch (downloads to cache) followed by dvc checkout (links from cache to workspace). If you just want to pre-fetch data without placing it in the working directory, use dvc fetch on its own.
Switching Between Dataset Versions
This is where DVC earns its keep. Say you’ve iterated on your dataset – cleaned duplicates, added new samples, fixed labels. Each version was committed to Git alongside its .dvc file:
| |
The key insight: git checkout updates the .dvc pointer file, and dvc checkout updates the actual data to match that pointer. If the data version isn’t in your local cache, you’ll need dvc pull instead of dvc checkout.
Tag your data milestones so switching is easy:
| |
Building Reproducible Pipelines with dvc.yaml
Beyond simple data tracking, DVC can manage your entire ML pipeline. Define stages in a dvc.yaml file, and DVC handles dependency tracking and caching automatically.
| |
Parameters come from a params.yaml file:
| |
Run the whole pipeline:
| |
DVC builds a DAG from the stage dependencies and only re-runs stages whose inputs changed. If you edit src/train.py but data/raw.csv hasn’t changed, DVC skips prepare and featurize and jumps straight to train. This saves significant time on large datasets.
After running, DVC creates a dvc.lock file that records the exact hashes of every dependency and output. Commit both files:
| |
Now anyone can clone the repo and run dvc repro to reproduce your exact results.
Comparing Experiments with Metrics
DVC tracks metrics files across commits, so you can compare model performance across different versions:
| |
When you change a hyperparameter in params.yaml and run dvc repro, the new metrics get committed alongside the parameter change. Your entire experimental history – data version, code, parameters, and results – lives in Git.
| |
Complete Workflow Example
Here’s the full cycle from raw data to a versioned, reproducible experiment:
| |
Common Errors and Fixes
ERROR: failed to push data to the cloud
Usually a permissions or credentials issue. Verify your remote is configured and accessible:
| |
For S3, make sure your IAM user has s3:PutObject and s3:GetObject permissions on the bucket. For GCS, check that your service account has Storage Object Admin.
If pushes fail intermittently on large datasets, DVC tracks what succeeded – just re-run dvc push and it retries only the failed files.
ERROR: failed to pull data from the cloud
This happens when the data was never pushed, or the remote doesn’t have the version you need. Check:
| |
If it shows files as “missing from remote,” someone ran dvc add and committed the .dvc file but never ran dvc push. Track down the commit author and have them push.
ERROR: not a DVC repository
You’re running DVC commands outside a DVC-initialized directory. DVC looks for a .dvc/ directory by walking up from the current working directory. Either cd into the repo or run dvc init first.
| |
WARNING: Cache 'xxxxx' not found
The data hash in your .dvc file points to a cache entry that doesn’t exist locally. This is normal after a fresh clone. Run dvc pull to fetch from the remote. If the remote also doesn’t have it, the data was never pushed.
dvc checkout does nothing
You probably ran git checkout on the code but forgot the .dvc file. Make sure the .dvc pointer file actually changed:
| |
Merge conflicts in .dvc files
When two branches modify the same dataset, the .dvc file can conflict just like any other Git file. Pick the version you want (or re-run dvc add on the merged data), then commit:
| |
Tips Worth Knowing
Use .dvcignore to exclude files from DVC tracking. It works like .gitignore but for DVC commands. Handy for excluding temp files or logs inside tracked directories.
Prefer dvc add on directories over individual files. When you dvc add data/, DVC creates a single .dvc file for the entire directory. This is cleaner than tracking dozens of individual files and makes switching versions atomic.
Set up DVC in CI. After cloning in your CI pipeline, run dvc pull to fetch the data your pipeline needs. Use dvc repro to verify that your pipeline still produces the same results. If metrics change, your CI can flag it.
Combine with Git branches for experiments. Create a branch per experiment, modify params.yaml, run dvc repro, and commit the results. Comparing experiments is just dvc metrics diff main experiment-branch.
Related Guides
- How to Validate ML Datasets with Great Expectations
- How to Create and Share Datasets on Hugging Face Hub
- How to Clean and Deduplicate ML Datasets with Python
- How to Stream Real-Time Data for ML with Apache Kafka
- How to Build a Feature Store for ML with Feast
- How to Build a Dataset Monitoring Pipeline with Great Expectations and Airflow
- How to Build a Dataset Bias Detection Pipeline with Python
- How to Build a Data Versioning Pipeline with Delta Lake for ML
- How to Build ETL Pipelines for ML Data with Apache Airflow
- How to Build a Data Drift Detection Pipeline with Whylogs