How to Build a Data Lake Ingestion Pipeline with MinIO and PyArrow

Most data lake tutorials jump straight to AWS S3 and assume you have a cloud account ready. If you want to prototype locally, MinIO gives you a fully S3-compatible object store that runs in a single Docker container. Pair it with PyArrow’s native S3 filesystem and Parquet writer, and you get a fast ingestion pipeline that works identically whether you’re writing to a local MinIO bucket or a production S3 endpoint.

Here’s the quick version – spin up MinIO and write a Parquet file in under 20 lines:

1
2
3
4
5
docker run -d --name minio \
  -p 9000:9000 -p 9001:9001 \
  -e MINIO_ROOT_USER=minioadmin \
  -e MINIO_ROOT_PASSWORD=minioadmin \
  quay.io/minio/minio server /data --console-address ":9001"

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.fs as pafs

s3 = pafs.S3FileSystem(
    access_key="minioadmin",
    secret_key="minioadmin",
    endpoint_override="localhost:9000",
    scheme="http",
)

s3.create_dir("datalake")

table = pa.table({"id": [1, 2, 3], "value": [10.5, 20.3, 30.1]})
pq.write_table(table, "datalake/sample.parquet", filesystem=s3)

That’s the foundation. Now let’s build a real ingestion pipeline on top of it.

Setting Up MinIO with Docker

The docker run command above starts MinIO with the default credentials minioadmin/minioadmin. Port 9000 is the S3 API endpoint, and 9001 is the web console where you can browse buckets and objects.

Once the container is running, verify it’s healthy:

1
2
curl -s http://localhost:9000/minio/health/live
# Returns nothing with HTTP 200 if healthy

You can also open http://localhost:9001 in your browser to access the MinIO Console. Log in with minioadmin / minioadmin to create buckets and inspect uploaded files.

For the Python side, install the dependencies:

1
pip install pyarrow minio pandas

PyArrow’s S3FileSystem handles all the S3 protocol communication. You don’t need boto3 or the minio Python client for reads and writes – PyArrow talks S3 natively. The minio package is only useful if you need bucket management operations beyond what PyArrow’s filesystem exposes.

Writing Partitioned Parquet Files

Flat Parquet files work fine for small datasets, but once you’re ingesting millions of rows, you need partitioning. Partitioning splits data into directory trees based on column values, so queries that filter on the partition column only read the relevant files.

PyArrow’s write_to_dataset handles this automatically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.fs as pafs
import pyarrow.dataset as ds
from datetime import date

s3 = pafs.S3FileSystem(
    access_key="minioadmin",
    secret_key="minioadmin",
    endpoint_override="localhost:9000",
    scheme="http",
)

s3.create_dir("datalake")

# Sample data with a date column for partitioning
table = pa.table({
    "event_id": list(range(1, 7)),
    "event_type": ["click", "view", "click", "purchase", "view", "click"],
    "user_id": [101, 102, 101, 103, 104, 102],
    "amount": [0.0, 0.0, 0.0, 49.99, 0.0, 0.0],
    "event_date": ["2026-02-10", "2026-02-10", "2026-02-11",
                   "2026-02-11", "2026-02-12", "2026-02-12"],
})

# Write partitioned by event_date
ds.write_dataset(
    table,
    base_dir="datalake/events",
    format="parquet",
    partitioning=ds.partitioning(
        pa.schema([("event_date", pa.string())]),
        flavor="hive",
    ),
    filesystem=s3,
    existing_data_behavior="overwrite_or_ignore",
)

This creates a directory structure like:

1
2
3
4
5
6
7
datalake/events/
  event_date=2026-02-10/
    part-0.parquet
  event_date=2026-02-11/
    part-0.parquet
  event_date=2026-02-12/
    part-0.parquet

The flavor="hive" argument produces the column=value directory naming that tools like Spark, Trino, and DuckDB all understand. The existing_data_behavior="overwrite_or_ignore" flag lets you re-run ingestion without errors from existing files.

Reading Data Back with the Dataset API

Once your data is partitioned in MinIO, PyArrow’s dataset API reads it back efficiently. It pushes partition filters down so only the relevant directories get scanned:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pyarrow.dataset as ds
import pyarrow.fs as pafs

s3 = pafs.S3FileSystem(
    access_key="minioadmin",
    secret_key="minioadmin",
    endpoint_override="localhost:9000",
    scheme="http",
)

dataset = ds.dataset(
    "datalake/events",
    format="parquet",
    partitioning="hive",
    filesystem=s3,
)

# Read everything
full_table = dataset.to_table()
print(full_table.to_pandas())

# Filter to a single partition -- only reads that directory
filtered = dataset.to_table(
    filter=ds.field("event_date") == "2026-02-11"
)
print(filtered.to_pandas())

# Project specific columns to reduce memory usage
projected = dataset.to_table(
    columns=["event_id", "event_type", "amount"],
    filter=ds.field("amount") > 0,
)
print(projected.to_pandas())

The filter and column projection happen at the scan level, not after loading everything into memory. For large datasets this is the difference between reading 10 MB and 10 GB.

Complete Ingestion Function

Here’s a reusable function that takes a CSV or JSON file and writes it into the data lake as partitioned Parquet. It handles both formats, validates the partition column exists, and appends to existing partitions without overwriting:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import pyarrow as pa
import pyarrow.csv as pcsv
import pyarrow.json as pjson
import pyarrow.dataset as ds
import pyarrow.fs as pafs
from pathlib import Path


def get_minio_filesystem(
    endpoint: str = "localhost:9000",
    access_key: str = "minioadmin",
    secret_key: str = "minioadmin",
) -> pafs.S3FileSystem:
    return pafs.S3FileSystem(
        access_key=access_key,
        secret_key=secret_key,
        endpoint_override=endpoint,
        scheme="http",
    )


def ingest_to_lake(
    source_path: str,
    bucket: str,
    dataset_name: str,
    partition_cols: list[str],
    endpoint: str = "localhost:9000",
    access_key: str = "minioadmin",
    secret_key: str = "minioadmin",
) -> int:
    """Ingest a CSV or JSON file into the MinIO data lake as partitioned Parquet.

    Returns the number of rows written.
    """
    path = Path(source_path)
    suffix = path.suffix.lower()

    if suffix == ".csv":
        table = pcsv.read_csv(source_path)
    elif suffix in (".json", ".jsonl"):
        table = pjson.read_json(source_path)
    else:
        raise ValueError(f"Unsupported file format: {suffix}. Use .csv or .json")

    # Validate partition columns exist in the data
    missing = [col for col in partition_cols if col not in table.column_names]
    if missing:
        raise ValueError(f"Partition columns not found in data: {missing}")

    s3 = get_minio_filesystem(endpoint, access_key, secret_key)

    # Ensure bucket exists
    s3.create_dir(bucket)

    base_dir = f"{bucket}/{dataset_name}"

    partitioning = ds.partitioning(
        pa.schema([(col, table.schema.field(col).type) for col in partition_cols]),
        flavor="hive",
    )

    ds.write_dataset(
        table,
        base_dir=base_dir,
        format="parquet",
        partitioning=partitioning,
        filesystem=s3,
        existing_data_behavior="overwrite_or_ignore",
    )

    print(f"Wrote {table.num_rows} rows to {base_dir} "
          f"(partitioned by {partition_cols})")
    return table.num_rows


# Usage example
if __name__ == "__main__":
    rows = ingest_to_lake(
        source_path="events.csv",
        bucket="datalake",
        dataset_name="events",
        partition_cols=["event_date"],
    )
    print(f"Ingested {rows} rows")

To test it, create a sample CSV:

1
2
3
4
5
6
7
cat > events.csv << 'EOF'
event_id,event_type,user_id,amount,event_date
1,click,101,0.0,2026-02-10
2,view,102,0.0,2026-02-10
3,purchase,103,29.99,2026-02-11
4,click,104,0.0,2026-02-12
EOF

Then run:

1
2
python ingest.py
# Output: Wrote 4 rows to datalake/events (partitioned by ['event_date'])

Common Errors and Fixes

OSError: When resolving region for bucket 'datalake': AWS Error NETWORK_CONNECTION

MinIO isn’t running or the endpoint is wrong. Check that the Docker container is up with docker ps and that you’re connecting to port 9000 (the API port), not 9001 (the console).

FileNotFoundError or OSError: Path does not exist when reading

The bucket doesn’t exist yet. PyArrow’s S3FileSystem won’t create buckets implicitly on reads. Call s3.create_dir("your-bucket") before writing, or create the bucket through the MinIO Console.

ArrowInvalid: Partition column 'event_date' is not in the schema

Your partition column names in partition_cols don’t match the column names in the source file. Column names are case-sensitive – check for typos and casing differences. Print table.column_names to see what you’re working with.

OSError: When reading information for key '...': AWS Error ACCESS_DENIED

Wrong credentials. The default MinIO credentials are minioadmin/minioadmin unless you changed them in the docker run command. Double-check the MINIO_ROOT_USER and MINIO_ROOT_PASSWORD environment variables match your Python code.

Parquet files are tiny (a few KB each) and there are thousands of them

This happens when you write many small batches instead of accumulating data first. Small Parquet files hurt read performance because of per-file overhead. Buffer your data and write in larger batches (aim for 100 MB+ per file), or run a periodic compaction job that reads all small files in a partition and rewrites them as a single file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import pyarrow.dataset as ds
import pyarrow.fs as pafs

s3 = pafs.S3FileSystem(
    access_key="minioadmin",
    secret_key="minioadmin",
    endpoint_override="localhost:9000",
    scheme="http",
)

# Read all small files
dataset = ds.dataset("datalake/events", format="parquet",
                     partitioning="hive", filesystem=s3)
full_table = dataset.to_table()

# Rewrite as compacted files
ds.write_dataset(
    full_table,
    base_dir="datalake/events",
    format="parquet",
    partitioning=ds.partitioning(
        pa.schema([("event_date", pa.string())]),
        flavor="hive",
    ),
    filesystem=s3,
    existing_data_behavior="delete_matching",
    max_rows_per_file=1_000_000,
)

The existing_data_behavior="delete_matching" flag removes old files in each partition before writing the compacted replacements. The max_rows_per_file parameter controls how large each output file gets.

Setting Up MinIO with Docker#

Writing Partitioned Parquet Files#

Reading Data Back with the Dataset API#

Complete Ingestion Function#

Common Errors and Fixes#

Related Guides#

About the Author

Setting Up MinIO with Docker

Writing Partitioned Parquet Files

Reading Data Back with the Dataset API

Complete Ingestion Function

Common Errors and Fixes

Related Guides