How to Create and Share Datasets on Hugging Face Hub

The Full Workflow in 60 Seconds

You have local data files. You want them on Hugging Face Hub so your team (or the world) can load them with load_dataset("your-username/my-dataset"). Here is the fastest path from raw files to a published dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from datasets import Dataset, DatasetDict
import pandas as pd

# Load your local data
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

# Convert to a DatasetDict with named splits
ds = DatasetDict({
    "train": Dataset.from_pandas(df_train),
    "test": Dataset.from_pandas(df_test),
})

# Push to Hub -- this creates the repo if it doesn't exist
ds.push_to_hub("your-username/my-dataset", private=False)

That is it. The push_to_hub call handles repo creation, Parquet conversion, and uploading. Anyone can now load your dataset with:

1
2
3
from datasets import load_dataset

ds = load_dataset("your-username/my-dataset")

You need to authenticate first. Run this once:

1
2
pip install datasets huggingface_hub
huggingface-cli login

Paste your access token from huggingface.co/settings/tokens when prompted. The token gets cached locally so you only do this once per machine.

Creating Datasets from Different File Formats

The datasets library natively supports CSV, JSON, JSON Lines, Parquet, Arrow, and text files. You rarely need pandas as a middleman.

From CSV

1
2
3
4
5
6
from datasets import load_dataset

ds = load_dataset("csv", data_files={
    "train": "data/train.csv",
    "test": "data/test.csv",
})

From JSON Lines

JSON Lines (one JSON object per line) is the most common format for NLP datasets:

1
ds = load_dataset("json", data_files="data/records.jsonl")

From Parquet

Parquet is the best choice for large datasets. It is columnar, compressed, and supports efficient partial reads:

1
ds = load_dataset("parquet", data_files="data/train.parquet")

From a Python Dictionary

When you are generating data programmatically – say, from an API or a synthetic data pipeline – build the dataset directly:

1
2
3
4
5
6
7
8
9
from datasets import Dataset

data = {
    "text": ["The model performed well", "Loss diverged at epoch 3"],
    "label": [1, 0],
    "confidence": [0.95, 0.31],
}

ds = Dataset.from_dict(data)

Defining Splits and Features

Splits (train, test, validation) are not just conventions – downstream training libraries like Transformers expect them. Define them explicitly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from datasets import Dataset, DatasetDict, Features, Value, ClassLabel

features = Features({
    "text": Value("string"),
    "label": ClassLabel(names=["negative", "positive"]),
    "score": Value("float32"),
})

ds = DatasetDict({
    "train": Dataset.from_dict(train_data, features=features),
    "validation": Dataset.from_dict(val_data, features=features),
    "test": Dataset.from_dict(test_data, features=features),
})

Specifying Features explicitly matters. Without it, the library infers types, and it often gets them wrong – treating integer labels as int64 instead of ClassLabel, or string IDs as generic Value("string") when they should be categorical. ClassLabel is particularly important because it enables stratified splitting and gives downstream users label-to-name mapping for free.

Adding a Dataset Card

A dataset without a README is a dataset nobody trusts. Hugging Face uses a README.md with YAML front matter as the dataset card. You can create it programmatically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from huggingface_hub import DatasetCard, DatasetCardData

card_data = DatasetCardData(
    language="en",
    license="mit",
    size_categories=["10K<n<100K"],
    task_categories=["text-classification"],
    tags=["sentiment", "product-reviews"],
)

card = DatasetCard.from_template(
    card_data,
    template_path=None,  # uses the default template
    pretty_name="Product Review Sentiment Dataset",
    dataset_summary="50,000 product reviews labeled positive/negative, sourced from public e-commerce data.",
)

card.push_to_hub("your-username/my-dataset")

Or just create a README.md in your dataset repo manually. The YAML header is what populates the Hub’s metadata sidebar – license, language, task type. Fill it out. People filter datasets by these fields.

Pushing Updates and Versioning

Hub datasets are Git repos under the hood. Every push_to_hub call creates a new commit. For controlled versioning, use branches or revisions:

1
2
3
4
5
# Push to a specific branch
ds.push_to_hub("your-username/my-dataset", revision="v2")

# Load a specific revision
ds = load_dataset("your-username/my-dataset", revision="v2")

For large-scale updates where you want to append data without re-uploading everything, upload individual Parquet files with the huggingface_hub client:

1
2
3
4
5
6
7
8
9
from huggingface_hub import HfApi

api = HfApi()
api.upload_file(
    path_or_fileobj="data/new_batch.parquet",
    path_in_repo="data/train-00005.parquet",
    repo_id="your-username/my-dataset",
    repo_type="dataset",
)

This avoids re-uploading earlier shards. Hub automatically picks up all Parquet files matching the data/train-*.parquet pattern for the train split.

Loading with Streaming for Large Datasets

If your dataset is tens of gigabytes, you do not want to download the whole thing to disk just to read the first 100 rows. Streaming mode iterates over records without downloading:

1
2
3
4
5
6
ds = load_dataset("your-username/my-dataset", split="train", streaming=True)

for i, example in enumerate(ds):
    print(example)
    if i >= 4:
        break

Streaming works with filter, map, and shuffle – the processing happens lazily as you iterate. This is essential for working with datasets that do not fit in memory.

Filtering and Mapping

The map and filter methods are how you preprocess data before training. They run in parallel by default and cache results to disk so repeat runs are fast:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from datasets import load_dataset

ds = load_dataset("your-username/my-dataset")

# Filter out short texts
ds = ds.filter(lambda x: len(x["text"]) > 50)

# Tokenize with a Hugging Face tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=128)

ds = ds.map(tokenize, batched=True, batch_size=1000)

The batched=True flag is critical for performance. Without it, the function is called once per row. With it, you get batches of rows as dictionaries of lists, which tokenizers and NumPy operations handle much faster.

Using Datasets in a Training Loop

Here is how you wire a Hub dataset directly into a PyTorch training loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from datasets import load_dataset
from torch.utils.data import DataLoader

ds = load_dataset("your-username/my-dataset", split="train")

# Set the format to PyTorch tensors
ds.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

dataloader = DataLoader(ds, batch_size=32, shuffle=True)

for batch in dataloader:
    input_ids = batch["input_ids"]       # shape: (32, 128)
    attention_mask = batch["attention_mask"]
    labels = batch["label"]
    # Forward pass, loss, backward, step...
    break  # just showing the shape

set_format is a zero-copy operation – it does not duplicate data, it just changes how __getitem__ returns values. This means you can switch between "torch", "numpy", and "pandas" formats without memory overhead.

Common Errors and Fixes

`huggingface_hub.errors.HfHubHTTPError: 401 Client Error`

Your token is missing or expired. Re-authenticate:

1
huggingface-cli login

Make sure the token has write permissions. Read-only tokens can load datasets but cannot push.

`FileNotFoundError` When Loading from Hub

The dataset repo exists but has no data files, or the files are in an unexpected directory structure. Hub expects Parquet files under data/ or at the repo root. Check the repo contents:

1
2
3
4
5
from huggingface_hub import HfApi

api = HfApi()
files = api.list_repo_files("your-username/my-dataset", repo_type="dataset")
print(files)

`pyarrow.lib.ArrowInvalid: Could not convert X with type Y`

Your data has mixed types in a column – some rows are strings, others are integers. Arrow enforces strict schemas. Fix it before creating the dataset:

1
2
3
import pandas as pd

df["price"] = pd.to_numeric(df["price"], errors="coerce")

`datasets.exceptions.DatasetGenerationError` on CSV Load

Usually caused by mismatched column counts or encoding issues. Specify the delimiter and encoding explicitly:

1
ds = load_dataset("csv", data_files="data.csv", delimiter=",", encoding="utf-8")

If some rows have extra commas in text fields, you need to fix the CSV upstream or switch to JSON Lines, which does not have delimiter ambiguity.

`push_to_hub` Hangs on Large Datasets

For datasets over a few GB, the default upload can time out or run out of memory because it tries to convert everything to Parquet in one shot. Shard first:

1
ds.push_to_hub("your-username/my-dataset", max_shard_size="500MB")

This splits the upload into multiple Parquet files of roughly 500 MB each. The Hub reassembles them transparently on load.

Slow `map` Operations

If your map call is slow, check two things. First, use batched=True – per-row processing is 10-50x slower. Second, increase num_proc for CPU-bound transforms:

1
ds = ds.map(my_function, batched=True, num_proc=8)

Do not set num_proc higher than your CPU core count. On machines with limited RAM, more processes can cause OOM kills because each process loads its own shard into memory.

The Full Workflow in 60 Seconds#

Creating Datasets from Different File Formats#

From CSV#

From JSON Lines#

From Parquet#

From a Python Dictionary#

Defining Splits and Features#

Adding a Dataset Card#

Pushing Updates and Versioning#

Loading with Streaming for Large Datasets#

Filtering and Mapping#

Using Datasets in a Training Loop#

Common Errors and Fixes#

huggingface_hub.errors.HfHubHTTPError: 401 Client Error#

FileNotFoundError When Loading from Hub#

pyarrow.lib.ArrowInvalid: Could not convert X with type Y#

datasets.exceptions.DatasetGenerationError on CSV Load#

push_to_hub Hangs on Large Datasets#

Slow map Operations#

Related Guides#

About the Author