Data and Datasets

How to Generate Synthetic Training Data with Hugging Face's Synthetic Data Generator Without Triggering Model Collapse

Build synthetic training datasets using distilabel pipelines, then validate diversity and deduplicate to keep your model from collapsing on its own outputs.

How to Anonymize Training Data for ML Privacy

Protect sensitive data while training ML models with proven anonymization techniques and ready-to-use Python code for real datasets.

How to Build a Data Annotation Pipeline with Argilla

Build a complete annotation workflow with Argilla to label data, collect feedback, and improve your models

How to Build a Data Augmentation Pipeline for Tabular Data

Augment tabular datasets for ML training using oversampling, synthetic data generation, and noise injection techniques.

How to Build a Data Contamination Detection Pipeline for LLM Training

Find and flag contaminated samples in your training data before they leak benchmark answers into your model

How to Build a Data Drift Detection Pipeline with Whylogs

Detect data drift early using whylogs profiles, statistical tests, and automated constraint validation in Python

How to Build a Data Freshness Monitoring Pipeline with Python

Detect stale training data and broken pipelines early with automated freshness checks and Slack alerts in Python

How to Build a Data Labeling Pipeline with Label Studio

Build production labeling pipelines that combine human annotators with ML-assisted pre-labeling using Label Studio

How to Build a Data Lake Ingestion Pipeline with MinIO and PyArrow

Ingest raw data into a MinIO data lake as partitioned Parquet files using PyArrow and Python

How to Build a Data Lineage Tracker for ML Pipelines

Build a lightweight data lineage system that records every transformation from raw data to training set

How to Build a Data Outlier Detection Pipeline with PyOD

Detect and remove outliers from your training data using multiple PyOD detectors, model combination, and automated scoring

How to Build a Data Profiling and Auto-Cleaning Pipeline with Python

Profile messy datasets and auto-fix missing values, outliers, type errors, and duplicates with a reusable Python pipeline

How to Build a Data Quality Pipeline with Cleanlab

Find label errors, outliers, and duplicates in your training data using Cleanlab’s Python library with any classifier

How to Build a Data Reconciliation Pipeline for ML Training Sets

Compare ML training set versions automatically and catch drift, schema changes, and distribution shifts before they break your models.

How to Build a Data Sampling Pipeline for Large-Scale ML Training

Sample large datasets intelligently for ML training using stratified splits, importance weighting, and Polars

How to Build a Data Schema Evolution Pipeline for ML Datasets

Manage column additions, renames, and type changes in ML datasets with automated schema migrations

How to Build a Data Slicing and Stratification Pipeline for ML

Find weak spots in your model by slicing data into segments and evaluating per-slice performance.

How to Build a Data Validation Pipeline with Pydantic and Pandera

Catch bad data before it ruins your model by combining Pydantic and Pandera into a single validation pipeline.

How to Build a Data Versioning Pipeline with Delta Lake for ML

Version your ML datasets with Delta Lake and get time travel, schema enforcement, and audit history

How to Build a Dataset Bias Detection Pipeline with Python

Build automated checks that catch dataset biases before they poison your model training

How to Build a Dataset Card Generator for ML Documentation

Generate standardized dataset documentation with statistics, distribution plots, and bias checks automatically

How to Build a Dataset Changelog and Diff Pipeline with Python

Detect added, removed, and modified rows between dataset versions with a hash-based diff pipeline that produces clear changelogs.

How to Build a Dataset Export Pipeline with Multiple Format Support

Export ML datasets to any format your pipeline needs with a unified Python conversion toolkit

How to Build a Dataset Merge and Conflict Resolution Pipeline

Merge multiple annotation sources into a single clean dataset with automated conflict resolution strategies

How to Build a Dataset Monitoring Pipeline with Great Expectations and Airflow

Build a pipeline that catches bad data before training starts using GX expectation suites and Airflow’s TaskFlow API

How to Build a Dataset Versioning Pipeline with LakeFS

Set up lakeFS locally and use its Python SDK to version, branch, diff, and merge datasets in your ML pipeline

How to Build a Feature Engineering Pipeline with Featuretools

Automate feature generation from relational data with Featuretools DFS, custom primitives, and feature selection

How to Build a Feature Importance and Selection Pipeline with Scikit-Learn

Rank and select the most predictive features for your ML models using scikit-learn’s feature importance tools

How to Build a Streaming Data Ingestion Pipeline with Apache Arrow

Ingest and process streaming data for ML with Apache Arrow’s columnar format and zero-copy IPC

How to Build a Synthetic Tabular Data Pipeline with CTGAN

Create privacy-safe synthetic datasets that preserve statistical properties using CTGAN and the SDV library

How to Build Programmatic Labeling Pipelines with Snorkel

Skip manual labeling. Use Snorkel’s weak supervision to programmatically label thousands of examples in minutes with Python code

How to Create and Share Datasets on Hugging Face Hub

Go from raw CSV or JSON files to a published, versioned dataset that anyone can load with one line of Python

How to Handle Imbalanced Datasets for ML Training

Train accurate models on imbalanced data where rare classes matter most using resampling and loss weighting techniques

How to Process Large Datasets with Polars for ML

Replace slow Pandas pipelines with Polars for fast feature engineering, aggregations, and ML-ready data transforms.

How to Stream Real-Time Data for ML with Apache Kafka

Connect live data streams to your ML models using Kafka producers, consumers, and stream processing in Python

How to Augment Training Data with Albumentations and NLP Augmenter

Build augmentation pipelines for CV and NLP datasets with practical Python examples

How to Build a Feature Store for ML with Feast

Build a production-ready feature store using Feast with entity definitions, feature views, and materialization

How to Build a Vector Database Pipeline with Pinecone

Set up a production-ready Pinecone pipeline with serverless indexes, batch upserts, metadata filtering, and cost optimization.

How to Build ETL Pipelines for ML Data with Apache Airflow

Set up Airflow with Docker and create production-ready DAGs that extract, clean, validate, and load ML training data on a schedule

How to Clean and Deduplicate ML Datasets with Python

Clean messy training data and find near-duplicate records with pandas, datasketch, and text-dedup in practical Python workflows

How to Create Synthetic Training Data with LLMs

Use Claude or GPT-4 to create labeled training data when real data is scarce or expensive

How to Implement Active Learning for Efficient Model Training

Train better models with fewer labels using uncertainty sampling, query strategies, and pool-based loops

How to Label Training Data with LLM-Assisted Annotation

Build an LLM-powered annotation pipeline that cuts labeling time and cost dramatically

How to Validate ML Datasets with Great Expectations

Set up automated data quality checks for ML datasets with GX expectation suites and checkpoints

How to Version ML Datasets with DVC

Set up DVC to version datasets, switch between data snapshots, and build reproducible ML pipelines