Most data lake tutorials jump straight to AWS S3 and assume you have a cloud account ready. If you want to prototype locally, MinIO gives you a fully S3-compatible object store that runs in a single Docker container. Pair it with PyArrow’s native S3 filesystem and Parquet writer, and you get a fast ingestion pipeline that works identically whether you’re writing to a local MinIO bucket or a production S3 endpoint.
Here’s the quick version – spin up MinIO and write a Parquet file in under 20 lines:
| |
| |
That’s the foundation. Now let’s build a real ingestion pipeline on top of it.
Setting Up MinIO with Docker
The docker run command above starts MinIO with the default credentials minioadmin/minioadmin. Port 9000 is the S3 API endpoint, and 9001 is the web console where you can browse buckets and objects.
Once the container is running, verify it’s healthy:
| |
You can also open http://localhost:9001 in your browser to access the MinIO Console. Log in with minioadmin / minioadmin to create buckets and inspect uploaded files.
For the Python side, install the dependencies:
| |
PyArrow’s S3FileSystem handles all the S3 protocol communication. You don’t need boto3 or the minio Python client for reads and writes – PyArrow talks S3 natively. The minio package is only useful if you need bucket management operations beyond what PyArrow’s filesystem exposes.
Writing Partitioned Parquet Files
Flat Parquet files work fine for small datasets, but once you’re ingesting millions of rows, you need partitioning. Partitioning splits data into directory trees based on column values, so queries that filter on the partition column only read the relevant files.
PyArrow’s write_to_dataset handles this automatically:
| |
This creates a directory structure like:
| |
The flavor="hive" argument produces the column=value directory naming that tools like Spark, Trino, and DuckDB all understand. The existing_data_behavior="overwrite_or_ignore" flag lets you re-run ingestion without errors from existing files.
Reading Data Back with the Dataset API
Once your data is partitioned in MinIO, PyArrow’s dataset API reads it back efficiently. It pushes partition filters down so only the relevant directories get scanned:
| |
The filter and column projection happen at the scan level, not after loading everything into memory. For large datasets this is the difference between reading 10 MB and 10 GB.
Complete Ingestion Function
Here’s a reusable function that takes a CSV or JSON file and writes it into the data lake as partitioned Parquet. It handles both formats, validates the partition column exists, and appends to existing partitions without overwriting:
| |
To test it, create a sample CSV:
| |
Then run:
| |
Common Errors and Fixes
OSError: When resolving region for bucket 'datalake': AWS Error NETWORK_CONNECTION
MinIO isn’t running or the endpoint is wrong. Check that the Docker container is up with docker ps and that you’re connecting to port 9000 (the API port), not 9001 (the console).
FileNotFoundError or OSError: Path does not exist when reading
The bucket doesn’t exist yet. PyArrow’s S3FileSystem won’t create buckets implicitly on reads. Call s3.create_dir("your-bucket") before writing, or create the bucket through the MinIO Console.
ArrowInvalid: Partition column 'event_date' is not in the schema
Your partition column names in partition_cols don’t match the column names in the source file. Column names are case-sensitive – check for typos and casing differences. Print table.column_names to see what you’re working with.
OSError: When reading information for key '...': AWS Error ACCESS_DENIED
Wrong credentials. The default MinIO credentials are minioadmin/minioadmin unless you changed them in the docker run command. Double-check the MINIO_ROOT_USER and MINIO_ROOT_PASSWORD environment variables match your Python code.
Parquet files are tiny (a few KB each) and there are thousands of them
This happens when you write many small batches instead of accumulating data first. Small Parquet files hurt read performance because of per-file overhead. Buffer your data and write in larger batches (aim for 100 MB+ per file), or run a periodic compaction job that reads all small files in a partition and rewrites them as a single file:
| |
The existing_data_behavior="delete_matching" flag removes old files in each partition before writing the compacted replacements. The max_rows_per_file parameter controls how large each output file gets.
Related Guides
- How to Build a Dataset Export Pipeline with Multiple Format Support
- How to Build a Streaming Data Ingestion Pipeline with Apache Arrow
- How to Build a Dataset Merge and Conflict Resolution Pipeline
- How to Build a Data Contamination Detection Pipeline for LLM Training
- How to Build a Data Freshness Monitoring Pipeline with Python
- How to Build a Data Versioning Pipeline with Delta Lake for ML
- How to Build a Data Labeling Pipeline with Label Studio
- How to Build a Dataset Versioning Pipeline with LakeFS
- How to Build a Data Sampling Pipeline for Large-Scale ML Training
- How to Build a Feature Importance and Selection Pipeline with Scikit-Learn