How to Validate ML Datasets with Great Expectations

Quick Setup and First Validation

Great Expectations (GX) 1.x is the go-to library for declarative data validation in Python. You define what your data should look like, GX checks whether it actually does, and you get a detailed report of every failure. For ML work, this catches the silent killers: null features, label imbalance, drifted distributions, and out-of-range values that tank model performance without throwing errors.

Install GX and its pandas backend:

1
pip install 'great_expectations[pandas]'

Here’s a complete validation pipeline – from a raw pandas DataFrame to a pass/fail report:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import great_expectations as gx
import pandas as pd
import numpy as np

# Simulate an ML dataset
np.random.seed(42)
df = pd.DataFrame({
    "age": np.random.randint(18, 90, size=1000),
    "income": np.random.normal(55000, 15000, size=1000),
    "credit_score": np.random.randint(300, 850, size=1000),
    "label": np.random.choice([0, 1], size=1000, p=[0.7, 0.3]),
})
# Inject some bad data
df.loc[42, "income"] = None
df.loc[99, "credit_score"] = -5

# 1. Create an ephemeral data context (no filesystem config needed)
context = gx.get_context(mode="ephemeral")

# 2. Add a pandas data source and a DataFrame asset
data_source = context.data_sources.add_pandas(name="ml_data")
data_asset = data_source.add_dataframe_asset(name="training_set")

# 3. Build a batch definition
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    name="full_training_batch"
)

# 4. Create an expectation suite
suite = context.suites.add(
    gx.ExpectationSuite(name="ml_training_suite")
)

# 5. Add expectations
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="age")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToNotBeNull(column="income")
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="credit_score", min_value=300, max_value=850
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="income", min_value=0, max_value=500000
    )
)
suite.add_expectation(
    gx.expectations.ExpectColumnDistinctValuesToBeInSet(
        column="label", value_set=[0, 1]
    )
)

# 6. Create a validation definition and checkpoint
validation_definition = context.validation_definitions.add(
    gx.ValidationDefinition(
        name="validate_training_data",
        data=batch_definition,
        suite=suite,
    )
)

checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name="ml_checkpoint",
        validation_definitions=[validation_definition],
    )
)

# 7. Run the checkpoint with the actual dataframe
results = checkpoint.run(batch_parameters={"dataframe": df})

# 8. Inspect results
print(f"Overall success: {results.success}")
for result in results.run_results.values():
    for r in result["validation_result"].results:
        status = "PASS" if r.success else "FAIL"
        print(f"  [{status}] {r.expectation_config.type}: {r.expectation_config.kwargs}")

Run this and you’ll see two failures: the null income value and the out-of-range credit score. That’s the point – GX catches exactly these problems before your model trains on garbage.

ML-Specific Expectations

Standard null and range checks are table stakes. For ML datasets, you need to validate things that directly affect model quality.

Feature Distribution Checks

Detect data drift by checking that feature statistics stay within expected bounds. If your income column suddenly has a mean of $200k instead of $55k, your model’s predictions will be meaningless.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Check that the mean income is roughly where you expect
suite.add_expectation(
    gx.expectations.ExpectColumnMeanToBeBetween(
        column="income", min_value=40000, max_value=70000
    )
)

# Standard deviation shouldn't explode
suite.add_expectation(
    gx.expectations.ExpectColumnStdevToBeBetween(
        column="income", min_value=5000, max_value=30000
    )
)

# Median as a robust alternative
suite.add_expectation(
    gx.expectations.ExpectColumnMedianToBeBetween(
        column="age", min_value=30, max_value=60
    )
)

Label Balance Validation

Class imbalance isn’t always a bug, but unexpected imbalance is. If your binary labels shift from 70/30 to 99/1, something went wrong upstream.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Check that each label class represents at least 10% of the data
suite.add_expectation(
    gx.expectations.ExpectColumnProportionOfUniqueValuesToBeBetween(
        column="label", min_value=0.001, max_value=0.01
    )
)

# More directly: check the proportion of a specific value
# For binary classification, verify label=1 is between 15% and 50%
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeBetween(
        column="label", min_value=0, max_value=1,
        mostly=1.0  # 100% of values must satisfy this
    )
)

For finer control on class proportions, combine GX with a pre-check:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def check_label_balance(df: pd.DataFrame, label_col: str, min_ratio: float = 0.1):
    """Fail if any class is below min_ratio of total samples."""
    counts = df[label_col].value_counts(normalize=True)
    for cls, proportion in counts.items():
        if proportion < min_ratio:
            raise ValueError(
                f"Class {cls} is only {proportion:.1%} of data "
                f"(minimum: {min_ratio:.1%})"
            )
    return counts

# Run before GX validation
balance = check_label_balance(df, "label", min_ratio=0.15)
print(f"Label distribution:\n{balance}")

Column Type and Schema Checks

Schema drift is another quiet failure mode. A feature column that switches from float to string mid-pipeline won’t throw an error in pandas – it’ll just silently wreck your model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeOfType(
        column="income", type_="float64"
    )
)

suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeOfType(
        column="label", type_="int64"
    )
)

# Make sure no unexpected columns snuck in (or got dropped)
suite.add_expectation(
    gx.expectations.ExpectTableColumnsToMatchSet(
        column_set=["age", "income", "credit_score", "label"],
        exact_match=True,
    )
)

Building a Reusable Validation Pipeline

In practice, you’ll want a validation function that runs on every data load – whether that’s a new training batch, an evaluation split, or incoming inference data.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
import great_expectations as gx
import pandas as pd
from typing import Optional


def validate_ml_dataset(
    df: pd.DataFrame,
    suite_name: str = "ml_suite",
    expected_columns: Optional[list[str]] = None,
    label_column: str = "label",
    valid_labels: Optional[list] = None,
) -> dict:
    """
    Validate an ML dataset and return a summary.
    Raises ValueError on critical failures.
    """
    context = gx.get_context(mode="ephemeral")

    data_source = context.data_sources.add_pandas(name="source")
    asset = data_source.add_dataframe_asset(name="dataset")
    batch_def = asset.add_batch_definition_whole_dataframe(name="batch")

    suite = context.suites.add(
        gx.ExpectationSuite(name=suite_name)
    )

    # Schema checks
    if expected_columns:
        suite.add_expectation(
            gx.expectations.ExpectTableColumnsToMatchSet(
                column_set=expected_columns, exact_match=True
            )
        )

    # No nulls in any column
    for col in df.columns:
        suite.add_expectation(
            gx.expectations.ExpectColumnValuesToNotBeNull(
                column=col, mostly=0.99  # Allow up to 1% nulls
            )
        )

    # Label validation
    if valid_labels and label_column in df.columns:
        suite.add_expectation(
            gx.expectations.ExpectColumnDistinctValuesToBeInSet(
                column=label_column, value_set=valid_labels
            )
        )

    # Row count sanity check
    suite.add_expectation(
        gx.expectations.ExpectTableRowCountToBeBetween(
            min_value=100, max_value=10_000_000
        )
    )

    validation_def = context.validation_definitions.add(
        gx.ValidationDefinition(
            name="validation", data=batch_def, suite=suite
        )
    )
    checkpoint = context.checkpoints.add(
        gx.Checkpoint(name="checkpoint", validation_definitions=[validation_def])
    )

    results = checkpoint.run(batch_parameters={"dataframe": df})

    # Build summary
    failures = []
    for run_result in results.run_results.values():
        for r in run_result["validation_result"].results:
            if not r.success:
                failures.append({
                    "expectation": r.expectation_config.type,
                    "kwargs": r.expectation_config.kwargs,
                    "observed": r.result,
                })

    summary = {
        "success": results.success,
        "total_expectations": sum(
            1 for rr in results.run_results.values()
            for _ in rr["validation_result"].results
        ),
        "failures": failures,
    }

    if not results.success:
        print(f"VALIDATION FAILED: {len(failures)} expectation(s) broken")
        for f in failures:
            print(f"  - {f['expectation']}: {f['kwargs']}")

    return summary


# Usage
result = validate_ml_dataset(
    df=df,
    expected_columns=["age", "income", "credit_score", "label"],
    label_column="label",
    valid_labels=[0, 1],
)

This is the pattern I’d recommend for any ML project. Wrap your expectations in a function, call it at every data boundary (after loading, after preprocessing, before training), and fail loud.

Common Errors and Fixes

`ModuleNotFoundError: No module named 'great_expectations'`

You installed it but your virtual environment isn’t activated, or you have multiple Python versions.

1
2
3
4
5
# Verify the install
python -m pip show great_expectations

# If missing, install explicitly in the right env
python -m pip install 'great_expectations[pandas]'

`TypeError: get_context() got an unexpected keyword argument 'mode'`

You’re running GX 0.x, not 1.x. The mode="ephemeral" parameter was introduced in GX 1.0. Upgrade:

1
pip install --upgrade great_expectations

If you’re stuck on 0.x for some reason, use gx.get_context() without the mode argument – but be aware the entire API surface changed between 0.x and 1.x.

`DataSourceNotFoundError` or `add_pandas` Not Found

In older GX 0.x, you used context.sources.add_pandas(). In GX 1.x, it’s context.data_sources.add_pandas(). The attribute was renamed.

`ExpectationSuiteNotFoundError` When Running Checkpoint

Your suite must be added to the context via context.suites.add() before referencing it in a ValidationDefinition. If you create a suite object without adding it, the checkpoint can’t find it.

1
2
3
4
5
6
7
# Wrong: suite exists only as a local variable
suite = gx.ExpectationSuite(name="my_suite")
suite.add_expectation(...)

# Right: suite is registered with the context
suite = context.suites.add(gx.ExpectationSuite(name="my_suite"))
suite.add_expectation(...)

`mostly` Parameter Confusion

The mostly parameter is a float between 0 and 1, not a percentage. Setting mostly=99 means “99 times the total rows” which will never be satisfied. Use mostly=0.99 for “99% of rows must pass.”

Slow Validation on Large DataFrames

If you’re validating millions of rows and it’s slow, sample first:

1
2
sample = df.sample(n=50000, random_state=42)
results = checkpoint.run(batch_parameters={"dataframe": sample})

This is fine for distribution checks and null rate estimates. Don’t sample for schema or unique value checks – those need the full dataset.

When to Validate

Three places in your ML pipeline where validation pays for itself:

After data ingestion – before any preprocessing. Catch upstream schema changes, missing columns, and corrupted records immediately.
After feature engineering – verify that your transforms didn’t introduce NaNs, infinities, or type changes. A log transform on a column with zeros produces -inf silently.
Before model training – final gate. Check row counts, label distributions, and feature ranges against your training baseline. If something looks different from what the model was designed for, stop and investigate.

Skip validation during inference only if latency is critical and you’ve validated the upstream pipeline thoroughly. For batch inference, always validate.

Quick Setup and First Validation#

ML-Specific Expectations#

Feature Distribution Checks#

Label Balance Validation#

Column Type and Schema Checks#

Building a Reusable Validation Pipeline#

Common Errors and Fixes#

ModuleNotFoundError: No module named 'great_expectations'#

TypeError: get_context() got an unexpected keyword argument 'mode'#

DataSourceNotFoundError or add_pandas Not Found#

ExpectationSuiteNotFoundError When Running Checkpoint#

mostly Parameter Confusion#

Slow Validation on Large DataFrames#

When to Validate#

Related Guides#

About the Author