How to Build a Feature Store for ML with Feast

Install Feast and Scaffold a Project

Feast gives you a single place to define features, pull historical data for training, and serve fresh features at inference time. Instead of scattering feature logic across notebooks, training scripts, and serving code, you declare features once and Feast handles the rest.

Install Feast and bootstrap a new project:

1
2
3
4
5
pip install feast

# Create a new feature repository
feast init my_feature_store
cd my_feature_store/feature_repo

This creates a directory structure like:

1
2
3
4
5
my_feature_store/
  feature_repo/
    data/               # Sample parquet data
    example_repo.py     # Demo feature definitions
    feature_store.yaml  # Infrastructure config

The feature_store.yaml controls where your data lives. For local development, start with SQLite:

1
2
3
4
5
6
7
project: my_feature_store
registry: data/registry.db
provider: local
online_store:
  type: sqlite
  path: data/online_store.db
entity_key_serialization_version: 3

For production, you’d swap provider: local for provider: gcp or provider: aws and point the online store at Redis, DynamoDB, or Datastore. But SQLite works perfectly for getting your definitions right before deploying.

Define Entities and Feature Views

The core of Feast is the feature definition file – a plain Python module where you declare entities, data sources, and feature views. Feast scans these files when you run feast apply.

Here’s a realistic example for a ride-sharing ML model that predicts driver performance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# features.py
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float32, Int64, String

# Entity: the primary key your features are keyed on
driver = Entity(
    name="driver",
    join_keys=["driver_id"],
    description="Unique driver identifier",
)

# Data source: where the raw feature data lives
driver_stats_source = FileSource(
    name="driver_hourly_stats_source",
    path="data/driver_stats.parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

# Feature view: a logical grouping of features from one source
driver_hourly_stats = FeatureView(
    name="driver_hourly_stats",
    entities=[driver],
    ttl=timedelta(days=3),
    schema=[
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    online=True,
    source=driver_stats_source,
    tags={"team": "driver_performance"},
)

A few things to note here:

Entity defines what your features are keyed on. The join_keys list maps to columns in your data source.
ttl (time-to-live) tells Feast how stale a feature value can be before it’s considered expired. Set this based on how frequently your data updates.
schema uses Feast’s type system (Float32, Int64, String, etc.), not pandas dtypes.
online=True means this feature view participates in materialization to the online store.

Register and Deploy

Once your definitions are ready, apply them:

1
feast apply

You’ll see output like:

1
2
3
Created entity driver
Created feature view driver_hourly_stats
Created sqlite table my_feature_store_driver_hourly_stats

Feast registers your entities and feature views in the registry (a metadata catalog) and provisions the online store tables. If you change a definition – say, add a new field – running feast apply again updates the registry and migrates the infrastructure.

If something is wrong with your definitions, Feast tells you early. A missing timestamp_field in your source throws:

1
ValueError: FileSource driver_hourly_stats_source is missing a timestamp_field

And referencing an entity that doesn’t exist raises:

1
EntityNotFoundException: Entity "nonexistent_entity" does not exist in project "my_feature_store"

These errors surface at apply time, not at serving time, which saves you from debugging production failures.

Build Training Datasets with Historical Features

The offline store is where you pull point-in-time correct training data. “Point-in-time correct” means Feast joins features to your entity dataframe using the event timestamp, so you never leak future data into your training set.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from feast import FeatureStore
import pandas as pd
from datetime import datetime

store = FeatureStore(repo_path=".")

# Entity dataframe: the rows you want features for
entity_df = pd.DataFrame({
    "driver_id": [1001, 1002, 1003],
    "event_timestamp": [
        datetime(2026, 2, 10, 10, 0, 0),
        datetime(2026, 2, 11, 14, 30, 0),
        datetime(2026, 2, 12, 9, 15, 0),
    ],
})

# Pull historical features -- point-in-time join happens here
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
).to_df()

print(training_df.head())

The features parameter uses the format feature_view_name:feature_name. If you reference a feature that doesn’t exist, you get a clear error:

1
FeatureViewNotFoundException: Feature view "driver_stats" does not exist in project "my_feature_store"

Double-check the feature view name matches what you defined – it’s driver_hourly_stats, not driver_stats.

Materialize Features for Online Serving

Training uses the offline store. Inference uses the online store. Materialization bridges the two by snapshotting the latest feature values from your offline source into the online store.

1
2
3
# Materialize all features up to the current time
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME

materialize-incremental is the command you want in production. It tracks the last materialization timestamp and only processes new data on each run. The alternative, feast materialize <start> <end>, reprocesses the full time range every time and is mainly useful for backfills.

Once materialized, serve features at low latency:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Online serving -- millisecond latency lookups
online_features = store.get_online_features(
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips",
    ],
    entity_rows=[
        {"driver_id": 1001},
        {"driver_id": 1002},
    ],
).to_dict()

print(online_features)
# {'driver_id': [1001, 1002], 'conv_rate': [0.83, 0.71], ...}

If you query an entity that hasn’t been materialized, Feast returns None values for the features instead of raising an error. This is by design – your serving code should handle missing features gracefully.

Common Pitfalls and Fixes

Stale online features after data updates. You updated your parquet file but get_online_features returns old values. You need to run materialize-incremental again. The online store doesn’t auto-sync – materialization is an explicit step. In production, schedule it with Airflow, a cron job, or your orchestrator of choice.

Entity dataframe missing required columns. When calling get_historical_features, your entity dataframe must contain all join key columns (driver_id) plus an event_timestamp column. Missing either one gives you a KeyError during the join.

TTL causing null features. If your ttl is set to timedelta(days=1) but your data is 3 days old, the features expire and you get None back. During development, set a generous TTL or use timedelta(days=0) to disable expiration entirely.

Wrong entity key in online requests. Passing {"id": 1001} instead of {"driver_id": 1001} won’t raise a validation error in all Feast versions. You’ll just get null features back silently. Always match the exact join_keys name from your entity definition.

Moving to Production

For production deployments, swap out the local components:

Offline store: Use BigQuery, Snowflake, Redshift, or Spark instead of file-based sources
Online store: Use Redis, DynamoDB, or Datastore for sub-millisecond lookups
Registry: Use a SQL-backed registry (PostgreSQL, MySQL) for multi-team access

Your feature_store.yaml for a GCP production setup would look like:

1
2
3
4
5
6
7
8
project: my_feature_store
provider: gcp
registry: gs://my-bucket/feast/registry.db
online_store:
  type: redis
  connection_string: redis://10.0.0.5:6379
offline_store:
  type: bigquery

The key advantage is that none of your Python feature definitions change. The same FeatureView and Entity objects work regardless of whether you’re running locally with SQLite or in production with Redis and BigQuery. That’s the whole point of the abstraction.

Install Feast and Scaffold a Project#

Define Entities and Feature Views#

Register and Deploy#

Build Training Datasets with Historical Features#

Materialize Features for Online Serving#

Common Pitfalls and Fixes#

Moving to Production#

Related Guides#

About the Author