How to Build a Dataset Versioning Pipeline with LakeFS

Your training data changed and now the model is worse. Was it the new labels? The deduplication pass? That CSV someone uploaded last Tuesday? Without version control on your data, you’re guessing.

lakeFS gives you Git-like semantics – branches, commits, diffs, merges – on top of your data lake. You version entire datasets the same way you version code. No copying terabytes around, no naming files dataset_v3_final_FINAL.parquet.

Setting Up lakeFS Locally with Docker

Spin up a lakeFS instance with a single Docker command:

1
2
3
4
5
6
docker run --pull always \
  --name lakefs \
  -p 8000:8000 \
  -v lakefs-data:/home/lakefs \
  treeverse/lakefs:latest \
  run --quickstart

This starts lakeFS on http://localhost:8000 with a local blockstore. The quickstart mode creates default credentials automatically. Open the UI in your browser and you’ll see:

Access Key ID: AKIAIOSFOLQUICKSTART
Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

These are the default quickstart credentials. For production, you’d configure proper auth and a real object store (S3, GCS, Azure Blob).

Install the high-level Python SDK:

1
pip install lakefs

Creating a Repository and Uploading Data

The lakefs Python package provides a clean, high-level API. Configure the client and create a repository:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import lakefs
from lakefs.client import Client

# Connect to your local lakeFS instance
clt = Client(
    username="AKIAIOSFOLQUICKSTART",
    password="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    host="http://localhost:8000",
)

# Create a new repository with local storage
repo = lakefs.Repository("ml-datasets", client=clt).create(
    storage_namespace="local://ml-datasets",
    default_branch="main",
    exist_ok=True,
)
print(f"Repository created: {repo.id}")

The storage_namespace tells lakeFS where to physically store your data. For local development, local:// works fine. In production you’d use s3://your-bucket/prefix or equivalent.

Now upload a dataset to the main branch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
import io

# Create a sample dataset
df = pd.DataFrame({
    "text": [
        "The movie was fantastic and well-directed",
        "Terrible acting, waste of time",
        "Average film, nothing special",
        "A masterpiece of modern cinema",
        "Boring and predictable plot",
    ],
    "label": ["positive", "negative", "neutral", "positive", "negative"],
    "split": ["train", "train", "train", "test", "test"],
})

# Convert to Parquet bytes
buffer = io.BytesIO()
df.to_parquet(buffer, index=False)
parquet_bytes = buffer.getvalue()

# Upload to lakeFS
main_branch = repo.branch("main")
obj = main_branch.object("datasets/sentiment/reviews.parquet")
obj.upload(data=parquet_bytes, content_type="application/octet-stream")

# Commit the upload
ref = main_branch.commit(
    message="Add initial sentiment dataset",
    metadata={"source": "manual-labeling", "num_rows": str(len(df))},
)
print(f"Committed: {ref.id}")

Branching for Data Experiments

This is where lakeFS shines. Want to try a different labeling strategy? Create a branch – it’s instant and doesn’t copy any data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Create an experiment branch from main
experiment = repo.branch("experiment/relabel-neutral").create(
    source_reference="main",
)
print(f"Branch created: {experiment.id}")

# Read the current dataset from the experiment branch
with experiment.object("datasets/sentiment/reviews.parquet").reader(mode="rb") as f:
    df_exp = pd.read_parquet(io.BytesIO(f.read()))

# Modify labels -- reclassify "neutral" as "negative"
df_exp.loc[df_exp["label"] == "neutral", "label"] = "negative"

# Upload the modified dataset
buffer = io.BytesIO()
df_exp.to_parquet(buffer, index=False)
experiment.object("datasets/sentiment/reviews.parquet").upload(
    data=buffer.getvalue(),
    content_type="application/octet-stream",
)

# Commit the change on the experiment branch
experiment.commit(
    message="Relabel neutral samples as negative",
    metadata={"experiment": "neutral-to-negative"},
)

Your main branch is untouched. You can run training on both branches and compare results before deciding which version to keep.

Diffing and Merging Dataset Changes

Check what changed between branches before merging:

1
2
3
4
5
6
# Diff the experiment branch against main
diffs = list(main_branch.diff(other_ref=experiment))

for change in diffs:
    print(f"  {change.type}: {change.path}")
    # Output: type: changed, path: datasets/sentiment/reviews.parquet

The diff tells you which files were added, removed, or modified. Once you’re satisfied with the experiment results, merge it back:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Merge experiment branch into main
merge_ref = experiment.merge_into(main_branch)
print(f"Merged with commit: {merge_ref}")

# Verify main now has the updated data
with main_branch.object("datasets/sentiment/reviews.parquet").reader(mode="rb") as f:
    df_merged = pd.read_parquet(io.BytesIO(f.read()))

print(df_merged["label"].value_counts())
# negative    3
# positive    2

If something goes wrong after merging, you can revert to any previous commit using ref expressions. The merge commit on main is just another commit in the history, so you always have a path back.

Reading Versioned Data with Pandas

You can access any historical version of your data using commit references:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# List commits on main
for log in main_branch.log():
    print(f"{log.id[:12]}  {log.message}")
    print(f"  metadata: {log.metadata}")

# Read data from a specific commit
first_commit_id = list(main_branch.log())[-1].id
historical_ref = repo.ref(first_commit_id)

with historical_ref.object("datasets/sentiment/reviews.parquet").reader(mode="rb") as f:
    df_original = pd.read_parquet(io.BytesIO(f.read()))

print(f"Original dataset had {len(df_original)} rows")
print(f"Labels: {df_original['label'].unique()}")
# Labels: ['positive' 'negative' 'neutral']  -- neutral still exists here

This is powerful for reproducibility. Pin your training script to a specific commit and you can always recreate exact results.

Pre-Commit Hooks for Data Validation

lakeFS supports server-side hooks that run before commits are finalized. Upload a Lua-based validation action to enforce data quality rules:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Define a pre-commit hook that validates Parquet files
hook_yaml = """
name: Validate Dataset Schema
on:
  pre-commit:
    branches:
      - main
hooks:
  - id: check_commit_metadata
    type: lua
    properties:
      script: |
        -- Require commit messages to be non-empty
        msg = action.commit.message
        if not msg or #msg == 0 then
            error("Commit message cannot be empty")
        end

        -- Require source metadata on all commits
        source = action.commit.metadata["source"]
        if source == nil then
            error("Commits must include 'source' metadata field")
        end

        print("Validation passed: message and source metadata present")
"""

# Upload the hook to the _lakefs_actions/ prefix
main_branch.object("_lakefs_actions/validate_schema.yaml").upload(
    data=hook_yaml.encode("utf-8"),
    content_type="application/x-yaml",
)

main_branch.commit(
    message="Add pre-commit validation hook",
    metadata={"source": "pipeline-setup"},
)

Now any commit to main without a source metadata field will be rejected. You can write more sophisticated Lua hooks that check file formats, validate schemas, or enforce naming conventions.

Test that the hook works:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# This commit should fail -- no source metadata
try:
    main_branch.object("datasets/test.csv").upload(data=b"a,b\n1,2")
    main_branch.commit(message="Add test file")  # No metadata!
except Exception as e:
    print(f"Hook rejected commit: {e}")

# This commit should succeed
main_branch.object("datasets/test.csv").upload(data=b"a,b\n1,2")
ref = main_branch.commit(
    message="Add test file",
    metadata={"source": "unit-test"},
)
print(f"Commit succeeded: {ref.id[:12]}")

Common Errors and Fixes

lakefs.exceptions.NotFoundException: repository not found

You’re referencing a repository that doesn’t exist. Double check the repository name or create it first with exist_ok=True:

1
2
3
4
repo = lakefs.Repository("my-repo", client=clt).create(
    storage_namespace="local://my-repo",
    exist_ok=True,
)

ConnectionRefusedError: [Errno 111] Connection refused

lakeFS isn’t running or isn’t accessible at the configured host. Verify Docker is running:

1
2
3
docker ps | grep lakefs
# If empty, restart it:
docker start lakefs

lakefs.exceptions.ConflictException: branch already exists

Pass exist_ok=True when creating branches to avoid this:

1
branch = repo.branch("my-branch").create(source_reference="main", exist_ok=True)

lakefs.exceptions.ConflictException: conflict on merge

Two branches modified the same file. lakeFS doesn’t auto-merge file contents – it works at the object level. Resolve by choosing one version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Overwrite the conflicting file on your branch with the main version
with main_branch.object("datasets/data.parquet").reader(mode="rb") as f:
    data = f.read()

experiment.object("datasets/data.parquet").upload(data=data)
experiment.commit(
    message="Resolve conflict: use main version",
    metadata={"source": "conflict-resolution"},
)
# Now retry the merge
experiment.merge_into(main_branch)

PermissionError when using Docker volumes

If lakeFS writes root-owned files, run the container with your user ID:

1
2
3
4
docker run --user $(id -u):$(id -g) \
  -p 8000:8000 \
  -v lakefs-data:/home/lakefs \
  treeverse/lakefs:latest run --quickstart

Setting Up lakeFS Locally with Docker#

Creating a Repository and Uploading Data#

Branching for Data Experiments#

Diffing and Merging Dataset Changes#

Reading Versioned Data with Pandas#

Pre-Commit Hooks for Data Validation#

Common Errors and Fixes#

Related Guides#

About the Author