Install Feast and Scaffold a Project
Feast gives you a single place to define features, pull historical data for training, and serve fresh features at inference time. Instead of scattering feature logic across notebooks, training scripts, and serving code, you declare features once and Feast handles the rest.
Install Feast and bootstrap a new project:
| |
This creates a directory structure like:
| |
The feature_store.yaml controls where your data lives. For local development, start with SQLite:
| |
For production, you’d swap provider: local for provider: gcp or provider: aws and point the online store at Redis, DynamoDB, or Datastore. But SQLite works perfectly for getting your definitions right before deploying.
Define Entities and Feature Views
The core of Feast is the feature definition file – a plain Python module where you declare entities, data sources, and feature views. Feast scans these files when you run feast apply.
Here’s a realistic example for a ride-sharing ML model that predicts driver performance:
| |
A few things to note here:
Entitydefines what your features are keyed on. Thejoin_keyslist maps to columns in your data source.ttl(time-to-live) tells Feast how stale a feature value can be before it’s considered expired. Set this based on how frequently your data updates.schemauses Feast’s type system (Float32,Int64,String, etc.), not pandas dtypes.online=Truemeans this feature view participates in materialization to the online store.
Register and Deploy
Once your definitions are ready, apply them:
| |
You’ll see output like:
| |
Feast registers your entities and feature views in the registry (a metadata catalog) and provisions the online store tables. If you change a definition – say, add a new field – running feast apply again updates the registry and migrates the infrastructure.
If something is wrong with your definitions, Feast tells you early. A missing timestamp_field in your source throws:
| |
And referencing an entity that doesn’t exist raises:
| |
These errors surface at apply time, not at serving time, which saves you from debugging production failures.
Build Training Datasets with Historical Features
The offline store is where you pull point-in-time correct training data. “Point-in-time correct” means Feast joins features to your entity dataframe using the event timestamp, so you never leak future data into your training set.
| |
The features parameter uses the format feature_view_name:feature_name. If you reference a feature that doesn’t exist, you get a clear error:
| |
Double-check the feature view name matches what you defined – it’s driver_hourly_stats, not driver_stats.
Materialize Features for Online Serving
Training uses the offline store. Inference uses the online store. Materialization bridges the two by snapshotting the latest feature values from your offline source into the online store.
| |
materialize-incremental is the command you want in production. It tracks the last materialization timestamp and only processes new data on each run. The alternative, feast materialize <start> <end>, reprocesses the full time range every time and is mainly useful for backfills.
Once materialized, serve features at low latency:
| |
If you query an entity that hasn’t been materialized, Feast returns None values for the features instead of raising an error. This is by design – your serving code should handle missing features gracefully.
Common Pitfalls and Fixes
Stale online features after data updates. You updated your parquet file but get_online_features returns old values. You need to run materialize-incremental again. The online store doesn’t auto-sync – materialization is an explicit step. In production, schedule it with Airflow, a cron job, or your orchestrator of choice.
Entity dataframe missing required columns. When calling get_historical_features, your entity dataframe must contain all join key columns (driver_id) plus an event_timestamp column. Missing either one gives you a KeyError during the join.
TTL causing null features. If your ttl is set to timedelta(days=1) but your data is 3 days old, the features expire and you get None back. During development, set a generous TTL or use timedelta(days=0) to disable expiration entirely.
Wrong entity key in online requests. Passing {"id": 1001} instead of {"driver_id": 1001} won’t raise a validation error in all Feast versions. You’ll just get null features back silently. Always match the exact join_keys name from your entity definition.
Moving to Production
For production deployments, swap out the local components:
- Offline store: Use BigQuery, Snowflake, Redshift, or Spark instead of file-based sources
- Online store: Use Redis, DynamoDB, or Datastore for sub-millisecond lookups
- Registry: Use a SQL-backed registry (PostgreSQL, MySQL) for multi-team access
Your feature_store.yaml for a GCP production setup would look like:
| |
The key advantage is that none of your Python feature definitions change. The same FeatureView and Entity objects work regardless of whether you’re running locally with SQLite or in production with Redis and BigQuery. That’s the whole point of the abstraction.
Related Guides
- How to Build a Feature Engineering Pipeline with Featuretools
- How to Stream Real-Time Data for ML with Apache Kafka
- How to Version ML Datasets with DVC
- How to Validate ML Datasets with Great Expectations
- How to Build a Dataset Monitoring Pipeline with Great Expectations and Airflow
- How to Build ETL Pipelines for ML Data with Apache Airflow
- How to Build a Data Drift Detection Pipeline with Whylogs
- How to Build a Feature Importance and Selection Pipeline with Scikit-Learn
- How to Build a Data Contamination Detection Pipeline for LLM Training
- How to Build a Dataset Changelog and Diff Pipeline with Python