Bad data kills ML models silently. You train for hours, deploy to production, and then wonder why accuracy tanked. Nine times out of ten, the culprit is upstream data that drifted, got corrupted, or never matched your assumptions in the first place.
The fix: validate everything before it touches your training pipeline. Pydantic handles record-level validation (individual rows), and Pandera handles DataFrame-level validation (schema, distributions, cross-column constraints). Together, they catch problems at two different granularities.
Install the Dependencies
| |
Pydantic v2 is the current stable release. Pandera works with pandas, polars, and other backends – we’ll stick with pandas here.
Define a Pydantic Model for Record Validation
Say you’re working with a customer churn dataset. Each row represents a customer with tenure, monthly charges, a contract type, and a churn label. Pydantic lets you define exactly what a valid record looks like.
| |
A few things to note. Field(ge=0, le=120) enforces that tenure is between 0 and 120 months – no negative values, no absurd outliers. The Literal type pins contract and churn to their exact valid values. The total_must_exceed_monthly validator catches a common data bug where total charges are somehow less than a single month’s bill.
Validate Individual Records
Feed your raw data through Pydantic row by row:
| |
This gives you a clean list of valid records and a detailed error log. Each Pydantic ValidationError tells you exactly which field failed and why, down to the constraint that was violated.
For large datasets, iterating row by row is slow. You can speed this up with model_validate in a list comprehension or use multiprocessing. But for datasets under a few hundred thousand rows, the overhead is minimal compared to what you save in debugging time.
Define a Pandera Schema for DataFrame Validation
Pydantic validates one record at a time. Pandera validates the entire DataFrame at once – column types, null counts, value ranges, and statistical properties.
| |
The coerce=True flag tells Pandera to attempt type coercion before validation. If tenure_months arrives as a float column (common with CSV reads), Pandera will try to cast it to int first. The DataFrame-level Check at the bottom enforces cross-column relationships across all rows simultaneously.
Use Pandera DataFrameModel for a Class-Based Schema
If you prefer a declarative style (and you should – it’s cleaner), Pandera also supports class-based schemas similar to Pydantic:
| |
The strict = True setting rejects any columns not defined in the schema. This catches silent data drift where new columns sneak in or old ones get renamed. The tenure_distribution_check is a custom statistical check – if median tenure drops below 5 months, something is probably wrong with how the data was extracted.
Build the Combined Pipeline
Here’s the full pipeline that runs both Pydantic and Pandera validation in sequence:
| |
The order matters. Pydantic runs first because it filters out individually broken records – nulls in required fields, out-of-range values, wrong types. Pandera runs second on the cleaned DataFrame to check aggregate properties like distributions and cross-column constraints. If you reverse this, Pandera will choke on the bad rows that Pydantic would have caught.
Add Custom Distribution Checks
For ML pipelines, you often need to verify that feature distributions haven’t shifted. Pandera makes this straightforward:
| |
These checks act as guardrails. If your churn rate suddenly hits 90%, something upstream broke – maybe a filter changed, maybe the label logic shifted. Catching this during validation is infinitely cheaper than catching it after a failed training run.
Common Errors and Fixes
ValidationError: value is not a valid integer – This happens when CSV columns load as floats (e.g., tenure_months becomes 12.0). Either coerce in pandas before validation with df["tenure_months"] = df["tenure_months"].astype(int), or set coerce=True in your Pandera schema.
SchemaError: column not in dataframe – Your CSV column names don’t match the schema. Check for trailing spaces or case differences. A quick df.columns = df.columns.str.strip().str.lower() before validation fixes most of these.
SchemaError: unexpected columns – You set strict=True but the DataFrame has extra columns. Either drop them with df = df[expected_columns] before validation, or set strict=False if extra columns are acceptable.
Pydantic field_validator not firing – Make sure you use @classmethod and @field_validator("field_name") decorators in the right order. The @field_validator decorator must come first (outermost).
Pandera @pa.check returning wrong type – Your check function must return a bool for DataFrame-level checks or a Series[bool] for element-wise column checks. Returning a scalar from a column check will silently pass.
Slow row-by-row Pydantic validation – For datasets over 500k rows, batch the validation. Split the DataFrame into chunks with np.array_split(df, 10) and validate each chunk. Or skip Pydantic for large datasets and rely on Pandera alone, since it operates on vectorized pandas operations.
Related Guides
- How to Build a Dataset Bias Detection Pipeline with Python
- How to Build a Data Schema Evolution Pipeline for ML Datasets
- How to Build a Dataset Changelog and Diff Pipeline with Python
- How to Build a Dataset Monitoring Pipeline with Great Expectations and Airflow
- How to Build a Feature Engineering Pipeline with Featuretools
- How to Build a Data Slicing and Stratification Pipeline for ML
- How to Build a Data Reconciliation Pipeline for ML Training Sets
- How to Build a Data Profiling and Auto-Cleaning Pipeline with Python
- How to Build a Data Annotation Pipeline with Argilla
- How to Build a Data Contamination Detection Pipeline for LLM Training