The Fast Path: Exact and Near-Duplicate Removal
Duplicate training examples bias your model toward repeated patterns and inflate metrics during evaluation. The fix depends on what kind of duplicates you have. Exact duplicates are trivial with pandas. Near-duplicates – rephrased text, slightly altered records – require fuzzy hashing.
Here is a complete pipeline that handles both cases on a text dataset:
| |
Install datasketch first:
| |
The threshold=0.8 means two documents with 80% or more Jaccard similarity count as near-duplicates. For training data, 0.7-0.8 works well. Push it to 0.9 if you only want to catch near-identical copies.
Handling Missing Values Without Destroying Signal
Missing values in ML datasets are not just an annoyance – they crash your training loop or silently degrade predictions. The right fix depends on the column type and how much is missing.
| |
Do not blindly use mean imputation on numerical columns. If your feature has outliers (and ML features usually do), the mean gets pulled toward those extremes. Median is almost always the safer default.
A common error you will hit with fillna on categorical columns:
| |
This happens when a column is entirely null, so mode() returns an empty Series. Guard against it:
| |
Detecting and Removing Outliers
Outliers in feature columns shift learned decision boundaries. The IQR method catches the obvious ones without assuming your data is normally distributed:
| |
Be careful with the factor parameter. The default 1.5 is aggressive – it trims roughly 5% of normally distributed data. Use 3.0 if you only want to catch extreme outliers that are almost certainly data errors.
Scaling Up: text-dedup for Large Corpora
For datasets over a few hundred thousand rows, the per-row loop above gets slow. The text-dedup library wraps MinHash, SimHash, and suffix array methods into a CLI and Python API purpose-built for deduplicating large text corpora like those used for LLM training:
| |
text-dedup works directly with Hugging Face datasets and handles sharding, parallel processing, and memory management. On a single machine, it can deduplicate tens of millions of documents. For exact substring dedup (catching copy-pasted paragraphs rather than whole-document duplicates), use the suffix array method:
| |
This is the same approach Google Research used to deduplicate C4 and other pretraining corpora.
Fixing Data Types and Inconsistent Formats
Type mismatches silently corrupt your features. A “price” column read as strings because one row has a dollar sign, or dates stored as mixed formats, will either crash your model or produce garbage features:
| |
The errors="coerce" flag is essential. Without it, pd.to_numeric raises a ValueError on the first non-numeric string and your whole pipeline stops. With it, bad values become NaN and you can handle them with the imputation code from earlier.
Common Pitfalls
Deduplicating after train/test split. Always deduplicate the full dataset first, then split. If the same record ends up in both train and test, your evaluation metrics are lying.
Using drop_duplicates on float columns. Floating-point comparison is unreliable. Two values that should be identical might differ by 1e-15 due to precision. Round first:
| |
Ignoring label-text mismatches. Duplicate text with different labels is a data quality issue, not just a dedup problem. Find and flag them:
| |
These conflicting records need manual review or majority-vote resolution – dropping them outright can remove valid edge cases.
Related Guides
- How to Build a Dataset Bias Detection Pipeline with Python
- How to Validate ML Datasets with Great Expectations
- How to Build a Data Profiling and Auto-Cleaning Pipeline with Python
- How to Build a Data Reconciliation Pipeline for ML Training Sets
- How to Version ML Datasets with DVC
- How to Build a Dataset Changelog and Diff Pipeline with Python
- How to Build a Data Outlier Detection Pipeline with PyOD
- How to Create and Share Datasets on Hugging Face Hub
- How to Build a Data Freshness Monitoring Pipeline with Python
- How to Build a Dataset Merge and Conflict Resolution Pipeline