How to Version ML Datasets with DVC

Quick Setup and First Tracked Dataset

DVC (Data Version Control) works like Git for data. You keep your code in Git and your large files – datasets, models, artifacts – in a separate storage backend, with .dvc files acting as lightweight pointers that Git tracks. When you check out a commit, DVC knows exactly which version of your data belongs with that code.

Install DVC 3.x and initialize it in an existing Git repo:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
pip install dvc

# Optional: install backend-specific packages
pip install dvc-s3    # for AWS S3
pip install dvc-gs    # for Google Cloud Storage
pip install dvc-azure # for Azure Blob Storage

# Initialize DVC inside your git repo
cd your-ml-project
git init
dvc init

dvc init creates a .dvc/ directory with config files and a cache. Commit this right away:

1
2
git add .dvc .dvcignore
git commit -m "initialize dvc"

Now add a dataset. Suppose you have a data/ directory with training CSVs:

1
2
3
4
5
dvc add data/

# This creates data.dvc (the pointer file) and updates .gitignore
git add data.dvc data/.gitignore
git commit -m "add training data v1"

That’s it. Your data files live in the DVC cache, and Git only tracks the small .dvc file containing the MD5 hash of your data. The actual files stay out of Git history entirely.

Configuring Remote Storage

A DVC remote is where your data lives when it’s not on your local machine. Think of it like a Git remote, but for data. You need one if you want to share datasets across machines or with teammates.

S3

1
dvc remote add -d myremote s3://my-ml-bucket/datasets

DVC picks up AWS credentials from the standard chain: environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), ~/.aws/credentials, or an IAM role. No extra config needed if your AWS CLI already works.

Google Cloud Storage

1
dvc remote add -d myremote gs://my-ml-bucket/datasets

Authentication uses GOOGLE_APPLICATION_CREDENTIALS or whatever gcloud auth has configured.

Local / Network Storage

For quick setups or air-gapped environments, point to a local directory or NFS mount:

1
dvc remote add -d myremote /mnt/shared-storage/dvc-cache

The -d flag sets this as the default remote. Commit the config after adding it:

1
2
git add .dvc/config
git commit -m "configure dvc remote"

If you need to store credentials separately from the shared config, use --local:

1
2
dvc remote modify --local myremote access_key_id 'AKIAIOSFODNN7EXAMPLE'
dvc remote modify --local myremote secret_access_key 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

This writes to .dvc/config.local, which is .gitignored by default. Never commit credentials into .dvc/config.

Pushing and Pulling Data

Once a remote is configured, push your cached data to it:

1
dvc push

On another machine (or after a fresh git clone), pull data back:

1
2
3
git clone https://github.com/you/your-ml-project.git
cd your-ml-project
dvc pull

dvc pull is equivalent to running dvc fetch (downloads to cache) followed by dvc checkout (links from cache to workspace). If you just want to pre-fetch data without placing it in the working directory, use dvc fetch on its own.

Switching Between Dataset Versions

This is where DVC earns its keep. Say you’ve iterated on your dataset – cleaned duplicates, added new samples, fixed labels. Each version was committed to Git alongside its .dvc file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# You're on the latest version
ls data/
# train.csv  val.csv  test.csv  (v3, cleaned)

# Switch to the original raw dataset
git checkout v1-raw -- data.dvc
dvc checkout

ls data/
# train.csv  val.csv  test.csv  (v1, original)

# Go back to latest
git checkout main -- data.dvc
dvc checkout

The key insight: git checkout updates the .dvc pointer file, and dvc checkout updates the actual data to match that pointer. If the data version isn’t in your local cache, you’ll need dvc pull instead of dvc checkout.

Tag your data milestones so switching is easy:

1
2
3
4
5
git tag v1-raw
# ... make changes to data, dvc add, git commit ...
git tag v2-cleaned
# ... more changes ...
git tag v3-augmented

Building Reproducible Pipelines with dvc.yaml

Beyond simple data tracking, DVC can manage your entire ML pipeline. Define stages in a dvc.yaml file, and DVC handles dependency tracking and caching automatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py --input data/raw.csv --output data/prepared.csv
    deps:
      - src/prepare.py
      - data/raw.csv
    outs:
      - data/prepared.csv

  featurize:
    cmd: python src/featurize.py --input data/prepared.csv --output data/features.csv
    deps:
      - src/featurize.py
      - data/prepared.csv
    outs:
      - data/features.csv

  train:
    cmd: python src/train.py --features data/features.csv --model models/model.pkl
    deps:
      - src/train.py
      - data/features.csv
    params:
      - train.learning_rate
      - train.n_estimators
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py --model models/model.pkl --test data/features.csv
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/features.csv
    metrics:
      - eval_metrics.json:
          cache: false
    plots:
      - plots/confusion_matrix.csv:
          cache: false

Parameters come from a params.yaml file:

1
2
3
4
# params.yaml
train:
  learning_rate: 0.01
  n_estimators: 200

Run the whole pipeline:

1
dvc repro

DVC builds a DAG from the stage dependencies and only re-runs stages whose inputs changed. If you edit src/train.py but data/raw.csv hasn’t changed, DVC skips prepare and featurize and jumps straight to train. This saves significant time on large datasets.

After running, DVC creates a dvc.lock file that records the exact hashes of every dependency and output. Commit both files:

1
2
git add dvc.yaml dvc.lock params.yaml metrics.json
git commit -m "add training pipeline"

Now anyone can clone the repo and run dvc repro to reproduce your exact results.

Comparing Experiments with Metrics

DVC tracks metrics files across commits, so you can compare model performance across different versions:

1
2
3
4
5
# Show current metrics
dvc metrics show

# Compare metrics between branches or tags
dvc metrics diff v2-cleaned v3-augmented

When you change a hyperparameter in params.yaml and run dvc repro, the new metrics get committed alongside the parameter change. Your entire experimental history – data version, code, parameters, and results – lives in Git.

1
2
# Show parameter changes between commits
dvc params diff HEAD~1

Complete Workflow Example

Here’s the full cycle from raw data to a versioned, reproducible experiment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 1. Start a project
mkdir ml-project && cd ml-project
git init
dvc init

# 2. Configure remote storage
dvc remote add -d store s3://my-bucket/ml-project
git add .dvc .dvcignore
git commit -m "initialize project with dvc"

# 3. Add raw data
cp ~/downloads/dataset.csv data/raw.csv
dvc add data/raw.csv
git add data/raw.csv.dvc data/.gitignore
git commit -m "add raw dataset v1"
git tag data-v1

# 4. Push data to remote
dvc push

# 5. Clean and re-version the data
python scripts/clean.py data/raw.csv data/raw.csv
dvc add data/raw.csv
git add data/raw.csv.dvc
git commit -m "clean dataset: remove duplicates and fix labels"
git tag data-v2
dvc push

# 6. Set up a pipeline (create dvc.yaml, params.yaml, source scripts)
# ... then run:
dvc repro
git add dvc.yaml dvc.lock params.yaml metrics.json
git commit -m "train baseline model on cleaned data"

# 7. Experiment: change hyperparameters
# Edit params.yaml, then:
dvc repro
git add dvc.lock params.yaml metrics.json
git commit -m "increase learning rate to 0.05"

# 8. Compare results
dvc metrics diff HEAD~1

Common Errors and Fixes

`ERROR: failed to push data to the cloud`

Usually a permissions or credentials issue. Verify your remote is configured and accessible:

1
2
dvc remote list
dvc push -v  # verbose mode shows what's happening

For S3, make sure your IAM user has s3:PutObject and s3:GetObject permissions on the bucket. For GCS, check that your service account has Storage Object Admin.

If pushes fail intermittently on large datasets, DVC tracks what succeeded – just re-run dvc push and it retries only the failed files.

`ERROR: failed to pull data from the cloud`

This happens when the data was never pushed, or the remote doesn’t have the version you need. Check:

1
dvc status --cloud

If it shows files as “missing from remote,” someone ran dvc add and committed the .dvc file but never ran dvc push. Track down the commit author and have them push.

`ERROR: not a DVC repository`

You’re running DVC commands outside a DVC-initialized directory. DVC looks for a .dvc/ directory by walking up from the current working directory. Either cd into the repo or run dvc init first.

1
2
3
4
# Check if DVC is initialized
ls -la .dvc/
# If missing:
dvc init

`WARNING: Cache 'xxxxx' not found`

The data hash in your .dvc file points to a cache entry that doesn’t exist locally. This is normal after a fresh clone. Run dvc pull to fetch from the remote. If the remote also doesn’t have it, the data was never pushed.

`dvc checkout` does nothing

You probably ran git checkout on the code but forgot the .dvc file. Make sure the .dvc pointer file actually changed:

1
2
git diff data.dvc  # Should show hash changes
dvc checkout        # Now this will update the data

Merge conflicts in `.dvc` files

When two branches modify the same dataset, the .dvc file can conflict just like any other Git file. Pick the version you want (or re-run dvc add on the merged data), then commit:

1
2
3
4
# After resolving the conflict in data.dvc
dvc checkout
git add data.dvc
git commit -m "resolve data version conflict"

Tips Worth Knowing

Use .dvcignore to exclude files from DVC tracking. It works like .gitignore but for DVC commands. Handy for excluding temp files or logs inside tracked directories.

Prefer dvc add on directories over individual files. When you dvc add data/, DVC creates a single .dvc file for the entire directory. This is cleaner than tracking dozens of individual files and makes switching versions atomic.

Set up DVC in CI. After cloning in your CI pipeline, run dvc pull to fetch the data your pipeline needs. Use dvc repro to verify that your pipeline still produces the same results. If metrics change, your CI can flag it.

Combine with Git branches for experiments. Create a branch per experiment, modify params.yaml, run dvc repro, and commit the results. Comparing experiments is just dvc metrics diff main experiment-branch.

Quick Setup and First Tracked Dataset#

Configuring Remote Storage#

S3#

Google Cloud Storage#

Local / Network Storage#

Pushing and Pulling Data#

Switching Between Dataset Versions#

Building Reproducible Pipelines with dvc.yaml#

Comparing Experiments with Metrics#

Complete Workflow Example#

Common Errors and Fixes#

ERROR: failed to push data to the cloud#

ERROR: failed to pull data from the cloud#

ERROR: not a DVC repository#

WARNING: Cache 'xxxxx' not found#

dvc checkout does nothing#

Merge conflicts in .dvc files#

Tips Worth Knowing#

Related Guides#

About the Author