How to Build a Feature Engineering Pipeline with Featuretools

Manual feature engineering is tedious and error-prone. You stare at tables, brainstorm aggregations, write groupby after groupby, and hope you didn’t miss an important signal. Featuretools automates this. You describe your data’s structure – which tables exist and how they relate – and it generates hundreds of meaningful features in seconds using deep feature synthesis (DFS).

Install Featuretools and the dependencies you’ll need:

1
pip install featuretools pandas scikit-learn

Create an EntitySet from DataFrames

An EntitySet is Featuretools’ representation of your relational data. Think of it as an in-memory database: you add DataFrames as tables, declare primary keys, and define foreign key relationships.

Here’s a realistic e-commerce dataset with customers, orders, and order products:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import pandas as pd
import featuretools as ft

# Customers table
customers_df = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "signup_date": pd.to_datetime([
        "2024-01-10", "2024-03-15", "2024-06-01", "2024-07-20", "2024-09-05"
    ]),
    "country": ["US", "UK", "US", "DE", "UK"],
})

# Orders table
orders_df = pd.DataFrame({
    "order_id": range(1, 11),
    "customer_id": [1, 1, 2, 3, 3, 3, 4, 4, 5, 5],
    "order_date": pd.to_datetime([
        "2024-05-01", "2024-06-15", "2024-07-01", "2024-08-10",
        "2024-09-05", "2024-10-12", "2024-10-15", "2024-11-01",
        "2024-11-20", "2024-12-01",
    ]),
    "total_amount": [120.50, 45.00, 200.00, 89.99, 150.00,
                     310.00, 55.00, 78.50, 430.00, 62.00],
})

# Order products table
products_df = pd.DataFrame({
    "product_id": range(1, 16),
    "order_id": [1, 1, 2, 3, 3, 4, 5, 5, 6, 6, 7, 8, 9, 9, 10],
    "product_name": [
        "Laptop", "Mouse", "Keyboard", "Monitor", "Cable",
        "Headphones", "Webcam", "USB Hub", "Chair", "Desk Lamp",
        "Tablet", "Charger", "SSD", "RAM", "Speaker",
    ],
    "unit_price": [999.99, 25.00, 45.00, 179.99, 12.00,
                   89.99, 65.00, 30.00, 249.99, 45.00,
                   55.00, 35.00, 120.00, 89.00, 62.00],
    "quantity": [1, 2, 1, 1, 3, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1],
})

# Build the EntitySet
es = ft.EntitySet(id="ecommerce")

es = es.add_dataframe(
    dataframe_name="customers",
    dataframe=customers_df,
    index="customer_id",
    time_index="signup_date",
)

es = es.add_dataframe(
    dataframe_name="orders",
    dataframe=orders_df,
    index="order_id",
    time_index="order_date",
)

es = es.add_dataframe(
    dataframe_name="products",
    dataframe=products_df,
    index="product_id",
)

# Define relationships: customers -> orders -> products
es = es.add_relationship("customers", "customer_id", "orders", "customer_id")
es = es.add_relationship("orders", "order_id", "products", "order_id")

print(es)

The time_index parameter tells Featuretools which column represents when each row occurred. This is critical – DFS uses it to avoid data leakage by only computing features from past events.

Run Deep Feature Synthesis

DFS walks your entity relationships and automatically builds features by stacking aggregation and transform primitives. Aggregation primitives (like mean, count, max) roll up child data to parent level. Transform primitives (like month, year, weekday) create new columns within a single table.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Generate features for the customers table
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "count", "max", "min", "std", "num_unique"],
    trans_primitives=["month", "year", "weekday"],
    max_depth=2,
    verbose=True,
)

print(f"Generated {len(feature_defs)} features")
print(feature_matrix.head())

Setting max_depth=2 means Featuretools stacks primitives up to two levels deep. At depth 1 you get things like MEAN(orders.total_amount). At depth 2 you get features like MEAN(orders.SUM(products.unit_price)) – the average per-order sum of product prices for each customer. Increasing depth generates exponentially more features, but depth 2 covers most useful patterns.

You can also restrict which primitives apply to specific columns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "count"],
    trans_primitives=["month", "weekday"],
    max_depth=2,
    ignore_dataframes=[],
    primitive_options={
        "mean": {"include_dataframes": ["orders"]},
        "sum": {"include_columns": {"orders": ["total_amount"]}},
    },
)

This limits mean to only aggregate from the orders table and sum to only apply to total_amount.

Custom Primitives and Feature Selection

Sometimes the built-in primitives don’t capture your domain knowledge. You can define custom aggregation or transform primitives with the @ft.primitives.make_agg_primitive decorator or by subclassing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from featuretools.primitives import AggregationPrimitive
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double

class MeanAbsoluteDeviation(AggregationPrimitive):
    """Computes the mean absolute deviation of a numeric column."""
    name = "mean_absolute_deviation"
    input_types = [ColumnSchema(logical_type=Double)]
    return_type = ColumnSchema(logical_type=Double)

    def get_function(self):
        def mad(values):
            return (values - values.mean()).abs().mean()
        return mad

# Use the custom primitive in DFS
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "count", MeanAbsoluteDeviation],
    trans_primitives=["month"],
    max_depth=2,
)

print(f"Features with custom primitive: {len(feature_defs)}")

After generating features, drop redundant ones. Highly correlated features add noise and slow down training without improving accuracy:

1
2
3
4
5
6
from featuretools.selection import remove_highly_correlated_features

# Drop features with correlation > 0.95
reduced_matrix = remove_highly_correlated_features(feature_matrix, pct=0.95)
print(f"Before: {feature_matrix.shape[1]} features")
print(f"After:  {reduced_matrix.shape[1]} features")

Featuretools also provides remove_highly_null_features and remove_single_value_features for additional cleanup:

1
2
3
4
5
6
7
8
from featuretools.selection import (
    remove_highly_null_features,
    remove_single_value_features,
)

cleaned = remove_highly_null_features(reduced_matrix, pct_null_threshold=0.5)
cleaned = remove_single_value_features(cleaned)
print(f"Final feature count: {cleaned.shape[1]}")

Save Features and Integrate with Sklearn

Once you have a good feature set, save the definitions so you can reapply the same transformations to new data without rerunning DFS from scratch.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import json

# Save feature definitions
ft.save_features(feature_defs, "feature_definitions.json")

# Load them back later
loaded_defs = ft.load_features("feature_definitions.json")

# Calculate the feature matrix from saved definitions
new_feature_matrix = ft.calculate_feature_matrix(
    features=loaded_defs,
    entityset=es,
)

Plug the feature matrix directly into an sklearn pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Add a synthetic target for demonstration
np.random.seed(42)
y = np.array([1, 0, 1, 1, 0])  # One label per customer

# Fill NaN values that DFS may produce
X = cleaned.fillna(0)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=42
)

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", RandomForestClassifier(n_estimators=100, random_state=42)),
])

pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Test accuracy: {score:.2f}")

For production workflows, wrap the EntitySet construction and DFS call in a function so you can rerun the full pipeline on fresh data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def build_features(customers, orders, products):
    """End-to-end feature engineering pipeline."""
    es = ft.EntitySet(id="ecommerce")
    es = es.add_dataframe(
        dataframe_name="customers", dataframe=customers,
        index="customer_id", time_index="signup_date",
    )
    es = es.add_dataframe(
        dataframe_name="orders", dataframe=orders,
        index="order_id", time_index="order_date",
    )
    es = es.add_dataframe(
        dataframe_name="products", dataframe=products,
        index="product_id",
    )
    es = es.add_relationship("customers", "customer_id", "orders", "customer_id")
    es = es.add_relationship("orders", "order_id", "products", "order_id")

    feature_matrix, feature_defs = ft.dfs(
        entityset=es,
        target_dataframe_name="customers",
        agg_primitives=["mean", "sum", "count", "max", "min", "std"],
        trans_primitives=["month", "year"],
        max_depth=2,
    )
    reduced = remove_highly_correlated_features(feature_matrix, pct=0.95)
    return reduced.fillna(0), feature_defs

Common Errors and Fixes

KeyError: 'column_name' not found in dataframe – This happens when your relationship references a column that doesn’t exist in one of the DataFrames. Double-check column names match exactly, including case.

TypeError: time_index must be a Datetime or numeric column – Featuretools expects time_index columns to be proper datetime types. Convert them first with pd.to_datetime() before adding the DataFrame to the EntitySet.

Feature matrix has too many NaN values – This is normal when aggregations don’t apply to certain rows (a customer with no orders gets NaN for MEAN(orders.total_amount)). Use fillna(0) or a more domain-appropriate imputation strategy before feeding features to your model.

DFS runs out of memory on large datasets – Lower max_depth to 1, reduce the number of primitives, or use ft.dfs(..., chunk_size=100) to process the target DataFrame in smaller batches. You can also pass cutoff_time to limit the time window DFS considers.

duplicate index error when adding a DataFrame – Every DataFrame needs a unique index column. If your table doesn’t have one, create it: df = df.reset_index(drop=True) and pass index="index" or make a proper ID column.

Create an EntitySet from DataFrames#

Run Deep Feature Synthesis#

Custom Primitives and Feature Selection#

Save Features and Integrate with Sklearn#

Common Errors and Fixes#

Related Guides#

About the Author

Create an EntitySet from DataFrames

Run Deep Feature Synthesis

Custom Primitives and Feature Selection

Save Features and Integrate with Sklearn

Common Errors and Fixes

Related Guides