The Full Workflow in 60 Seconds
You have local data files. You want them on Hugging Face Hub so your team (or the world) can load them with load_dataset("your-username/my-dataset"). Here is the fastest path from raw files to a published dataset:
| |
That is it. The push_to_hub call handles repo creation, Parquet conversion, and uploading. Anyone can now load your dataset with:
| |
You need to authenticate first. Run this once:
| |
Paste your access token from huggingface.co/settings/tokens when prompted. The token gets cached locally so you only do this once per machine.
Creating Datasets from Different File Formats
The datasets library natively supports CSV, JSON, JSON Lines, Parquet, Arrow, and text files. You rarely need pandas as a middleman.
From CSV
| |
From JSON Lines
JSON Lines (one JSON object per line) is the most common format for NLP datasets:
| |
From Parquet
Parquet is the best choice for large datasets. It is columnar, compressed, and supports efficient partial reads:
| |
From a Python Dictionary
When you are generating data programmatically – say, from an API or a synthetic data pipeline – build the dataset directly:
| |
Defining Splits and Features
Splits (train, test, validation) are not just conventions – downstream training libraries like Transformers expect them. Define them explicitly:
| |
Specifying Features explicitly matters. Without it, the library infers types, and it often gets them wrong – treating integer labels as int64 instead of ClassLabel, or string IDs as generic Value("string") when they should be categorical. ClassLabel is particularly important because it enables stratified splitting and gives downstream users label-to-name mapping for free.
Adding a Dataset Card
A dataset without a README is a dataset nobody trusts. Hugging Face uses a README.md with YAML front matter as the dataset card. You can create it programmatically:
| |
Or just create a README.md in your dataset repo manually. The YAML header is what populates the Hub’s metadata sidebar – license, language, task type. Fill it out. People filter datasets by these fields.
Pushing Updates and Versioning
Hub datasets are Git repos under the hood. Every push_to_hub call creates a new commit. For controlled versioning, use branches or revisions:
| |
For large-scale updates where you want to append data without re-uploading everything, upload individual Parquet files with the huggingface_hub client:
| |
This avoids re-uploading earlier shards. Hub automatically picks up all Parquet files matching the data/train-*.parquet pattern for the train split.
Loading with Streaming for Large Datasets
If your dataset is tens of gigabytes, you do not want to download the whole thing to disk just to read the first 100 rows. Streaming mode iterates over records without downloading:
| |
Streaming works with filter, map, and shuffle – the processing happens lazily as you iterate. This is essential for working with datasets that do not fit in memory.
Filtering and Mapping
The map and filter methods are how you preprocess data before training. They run in parallel by default and cache results to disk so repeat runs are fast:
| |
The batched=True flag is critical for performance. Without it, the function is called once per row. With it, you get batches of rows as dictionaries of lists, which tokenizers and NumPy operations handle much faster.
Using Datasets in a Training Loop
Here is how you wire a Hub dataset directly into a PyTorch training loop:
| |
set_format is a zero-copy operation – it does not duplicate data, it just changes how __getitem__ returns values. This means you can switch between "torch", "numpy", and "pandas" formats without memory overhead.
Common Errors and Fixes
huggingface_hub.errors.HfHubHTTPError: 401 Client Error
Your token is missing or expired. Re-authenticate:
| |
Make sure the token has write permissions. Read-only tokens can load datasets but cannot push.
FileNotFoundError When Loading from Hub
The dataset repo exists but has no data files, or the files are in an unexpected directory structure. Hub expects Parquet files under data/ or at the repo root. Check the repo contents:
| |
pyarrow.lib.ArrowInvalid: Could not convert X with type Y
Your data has mixed types in a column – some rows are strings, others are integers. Arrow enforces strict schemas. Fix it before creating the dataset:
| |
datasets.exceptions.DatasetGenerationError on CSV Load
Usually caused by mismatched column counts or encoding issues. Specify the delimiter and encoding explicitly:
| |
If some rows have extra commas in text fields, you need to fix the CSV upstream or switch to JSON Lines, which does not have delimiter ambiguity.
push_to_hub Hangs on Large Datasets
For datasets over a few GB, the default upload can time out or run out of memory because it tries to convert everything to Parquet in one shot. Shard first:
| |
This splits the upload into multiple Parquet files of roughly 500 MB each. The Hub reassembles them transparently on load.
Slow map Operations
If your map call is slow, check two things. First, use batched=True – per-row processing is 10-50x slower. Second, increase num_proc for CPU-bound transforms:
| |
Do not set num_proc higher than your CPU core count. On machines with limited RAM, more processes can cause OOM kills because each process loads its own shard into memory.
Related Guides
- How to Generate Synthetic Training Data with Hugging Face’s Synthetic Data Generator Without Triggering Model Collapse
- How to Version ML Datasets with DVC
- How to Build a Data Versioning Pipeline with Delta Lake for ML
- How to Clean and Deduplicate ML Datasets with Python
- How to Validate ML Datasets with Great Expectations
- How to Create Synthetic Training Data with LLMs
- How to Process Large Datasets with Polars for ML
- How to Build a Data Schema Evolution Pipeline for ML Datasets
- How to Build a Dataset Bias Detection Pipeline with Python
- How to Build a Data Labeling Pipeline with Label Studio