How to Build a Data Validation Pipeline with Pydantic and Pandera

Bad data kills ML models silently. You train for hours, deploy to production, and then wonder why accuracy tanked. Nine times out of ten, the culprit is upstream data that drifted, got corrupted, or never matched your assumptions in the first place.

The fix: validate everything before it touches your training pipeline. Pydantic handles record-level validation (individual rows), and Pandera handles DataFrame-level validation (schema, distributions, cross-column constraints). Together, they catch problems at two different granularities.

Install the Dependencies

1
pip install pydantic pandera pandas

Pydantic v2 is the current stable release. Pandera works with pandas, polars, and other backends – we’ll stick with pandas here.

Define a Pydantic Model for Record Validation

Say you’re working with a customer churn dataset. Each row represents a customer with tenure, monthly charges, a contract type, and a churn label. Pydantic lets you define exactly what a valid record looks like.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from pydantic import BaseModel, Field, field_validator
from typing import Literal


class CustomerRecord(BaseModel):
    customer_id: str
    tenure_months: int = Field(ge=0, le=120)
    monthly_charges: float = Field(gt=0, le=500.0)
    total_charges: float = Field(ge=0)
    contract: Literal["Month-to-month", "One year", "Two year"]
    churn: Literal[0, 1]

    @field_validator("total_charges")
    @classmethod
    def total_must_exceed_monthly(cls, v, info):
        monthly = info.data.get("monthly_charges")
        if monthly is not None and v < monthly:
            raise ValueError(
                f"total_charges ({v}) cannot be less than monthly_charges ({monthly})"
            )
        return v

    @field_validator("customer_id")
    @classmethod
    def customer_id_not_empty(cls, v):
        if not v.strip():
            raise ValueError("customer_id cannot be blank")
        return v.strip()

A few things to note. Field(ge=0, le=120) enforces that tenure is between 0 and 120 months – no negative values, no absurd outliers. The Literal type pins contract and churn to their exact valid values. The total_must_exceed_monthly validator catches a common data bug where total charges are somehow less than a single month’s bill.

Validate Individual Records

Feed your raw data through Pydantic row by row:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import pandas as pd
from pydantic import ValidationError

df = pd.read_csv("customer_churn.csv")

valid_records = []
errors = []

for idx, row in df.iterrows():
    try:
        record = CustomerRecord(**row.to_dict())
        valid_records.append(record.model_dump())
    except ValidationError as e:
        errors.append({"row": idx, "errors": e.errors()})

print(f"Valid: {len(valid_records)}, Invalid: {len(errors)}")

if errors:
    for err in errors[:5]:
        print(f"Row {err['row']}: {err['errors']}")

This gives you a clean list of valid records and a detailed error log. Each Pydantic ValidationError tells you exactly which field failed and why, down to the constraint that was violated.

For large datasets, iterating row by row is slow. You can speed this up with model_validate in a list comprehension or use multiprocessing. But for datasets under a few hundred thousand rows, the overhead is minimal compared to what you save in debugging time.

Define a Pandera Schema for DataFrame Validation

Pydantic validates one record at a time. Pandera validates the entire DataFrame at once – column types, null counts, value ranges, and statistical properties.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import pandera as pa
from pandera import Column, Check, DataFrameSchema


churn_schema = DataFrameSchema(
    columns={
        "customer_id": Column(str, Check.str_length(min_value=1), nullable=False),
        "tenure_months": Column(int, Check.in_range(0, 120), nullable=False),
        "monthly_charges": Column(float, Check.in_range(0.01, 500.0), nullable=False),
        "total_charges": Column(float, Check.greater_than_or_equal_to(0), nullable=False),
        "contract": Column(
            str,
            Check.isin(["Month-to-month", "One year", "Two year"]),
            nullable=False,
        ),
        "churn": Column(int, Check.isin([0, 1]), nullable=False),
    },
    checks=[
        Check(
            lambda df: (df["total_charges"] >= df["monthly_charges"]).all(),
            error="total_charges must be >= monthly_charges for every row",
        ),
    ],
    coerce=True,
)

The coerce=True flag tells Pandera to attempt type coercion before validation. If tenure_months arrives as a float column (common with CSV reads), Pandera will try to cast it to int first. The DataFrame-level Check at the bottom enforces cross-column relationships across all rows simultaneously.

Use Pandera DataFrameModel for a Class-Based Schema

If you prefer a declarative style (and you should – it’s cleaner), Pandera also supports class-based schemas similar to Pydantic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import pandera as pa
from pandera import DataFrameModel, Field as PaField, Check


class ChurnDataFrame(DataFrameModel):
    customer_id: pa.typing.Series[str] = PaField(nullable=False)
    tenure_months: pa.typing.Series[int] = PaField(ge=0, le=120, nullable=False)
    monthly_charges: pa.typing.Series[float] = PaField(gt=0, le=500.0, nullable=False)
    total_charges: pa.typing.Series[float] = PaField(ge=0, nullable=False)
    contract: pa.typing.Series[str] = PaField(
        isin=["Month-to-month", "One year", "Two year"], nullable=False
    )
    churn: pa.typing.Series[int] = PaField(isin=[0, 1], nullable=False)

    @pa.check("tenure_months")
    def tenure_distribution_check(cls, series):
        """Flag if median tenure is suspiciously low -- could indicate data truncation."""
        return series.median() > 5

    @pa.dataframe_check
    def total_ge_monthly(cls, df):
        return (df["total_charges"] >= df["monthly_charges"]).all()

    class Config:
        coerce = True
        strict = True

The strict = True setting rejects any columns not defined in the schema. This catches silent data drift where new columns sneak in or old ones get renamed. The tenure_distribution_check is a custom statistical check – if median tenure drops below 5 months, something is probably wrong with how the data was extracted.

Build the Combined Pipeline

Here’s the full pipeline that runs both Pydantic and Pandera validation in sequence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import pandas as pd
from pydantic import ValidationError


def validate_churn_data(csv_path: str) -> pd.DataFrame:
    """Validate a churn dataset at both record and DataFrame levels."""
    df = pd.read_csv(csv_path)
    print(f"Loaded {len(df)} rows from {csv_path}")

    # Step 1: Record-level validation with Pydantic
    valid_rows = []
    pydantic_errors = []

    for idx, row in df.iterrows():
        try:
            record = CustomerRecord(**row.to_dict())
            valid_rows.append(record.model_dump())
        except ValidationError as e:
            pydantic_errors.append({"row": idx, "errors": e.errors()})

    if pydantic_errors:
        print(f"Pydantic rejected {len(pydantic_errors)} rows")
        for err in pydantic_errors[:3]:
            print(f"  Row {err['row']}: {err['errors'][0]['msg']}")

    clean_df = pd.DataFrame(valid_rows)

    # Step 2: DataFrame-level validation with Pandera
    try:
        validated_df = ChurnDataFrame.validate(clean_df)
        print(f"Pandera validation passed: {len(validated_df)} rows")
        return validated_df
    except pa.errors.SchemaError as e:
        print(f"Pandera validation failed: {e.check}")
        print(f"Failure cases:\n{e.failure_cases.head()}")
        raise


# Run it
validated = validate_churn_data("customer_churn.csv")

The order matters. Pydantic runs first because it filters out individually broken records – nulls in required fields, out-of-range values, wrong types. Pandera runs second on the cleaned DataFrame to check aggregate properties like distributions and cross-column constraints. If you reverse this, Pandera will choke on the bad rows that Pydantic would have caught.

Add Custom Distribution Checks

For ML pipelines, you often need to verify that feature distributions haven’t shifted. Pandera makes this straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
class ChurnDataFrameWithDistChecks(ChurnDataFrame):
    @pa.check("monthly_charges")
    def charges_not_skewed(cls, series):
        """Reject if 90th percentile is more than 10x the median."""
        median = series.median()
        p90 = series.quantile(0.9)
        return p90 <= median * 10

    @pa.check("churn")
    def churn_rate_sanity(cls, series):
        """Churn rate should be between 5% and 60% for this dataset."""
        rate = series.mean()
        return 0.05 <= rate <= 0.60

These checks act as guardrails. If your churn rate suddenly hits 90%, something upstream broke – maybe a filter changed, maybe the label logic shifted. Catching this during validation is infinitely cheaper than catching it after a failed training run.

Common Errors and Fixes

ValidationError: value is not a valid integer – This happens when CSV columns load as floats (e.g., tenure_months becomes 12.0). Either coerce in pandas before validation with df["tenure_months"] = df["tenure_months"].astype(int), or set coerce=True in your Pandera schema.

SchemaError: column not in dataframe – Your CSV column names don’t match the schema. Check for trailing spaces or case differences. A quick df.columns = df.columns.str.strip().str.lower() before validation fixes most of these.

SchemaError: unexpected columns – You set strict=True but the DataFrame has extra columns. Either drop them with df = df[expected_columns] before validation, or set strict=False if extra columns are acceptable.

Pydantic field_validator not firing – Make sure you use @classmethod and @field_validator("field_name") decorators in the right order. The @field_validator decorator must come first (outermost).

Pandera @pa.check returning wrong type – Your check function must return a bool for DataFrame-level checks or a Series[bool] for element-wise column checks. Returning a scalar from a column check will silently pass.

Slow row-by-row Pydantic validation – For datasets over 500k rows, batch the validation. Split the DataFrame into chunks with np.array_split(df, 10) and validate each chunk. Or skip Pydantic for large datasets and rely on Pandera alone, since it operates on vectorized pandas operations.

Install the Dependencies#

Define a Pydantic Model for Record Validation#

Validate Individual Records#

Define a Pandera Schema for DataFrame Validation#

Use Pandera DataFrameModel for a Class-Based Schema#

Build the Combined Pipeline#

Add Custom Distribution Checks#

Common Errors and Fixes#

Related Guides#

About the Author