The Setup: Instrument, Scrape, Visualize
You need three things: a FastAPI model server that exposes Prometheus metrics, a Prometheus instance that scrapes them, and Grafana dashboards that make those numbers useful. The whole stack runs in Docker Compose and takes about 20 minutes to wire up.
Here is the instrumented FastAPI server. This is the core of the whole system – every other component just reads from what this exposes.
| |
The key decisions here: use a Histogram for latency (not a Summary) because histograms let Prometheus compute arbitrary percentiles server-side. The prediction value histogram tracks output distribution – when that shape changes, your model or your input data has drifted. The Gauge for feature drift lets you push a computed drift score that Grafana can threshold on.
Prometheus Configuration
Prometheus needs to know where to scrape. Create prometheus/prometheus.yml:
| |
The 15-second scrape interval is a good default. Going lower than 10 seconds creates a lot of storage churn for marginal benefit. Going higher than 30 seconds means you miss short latency spikes.
Alerting Rules
This is where monitoring turns into something actionable. Create prometheus/alerts.yml:
| |
The PredictionDistributionShift alert is the most interesting one. It compares the median prediction right now against the median from 24 hours ago. A shift of 0.15 on a 0-1 scale is significant enough to investigate but not so sensitive that normal traffic variation triggers it. Tune this threshold based on your model’s output range.
Docker Compose Stack
Wire everything together with Compose:
| |
Start everything with docker compose up -d. Hit http://localhost:9090/targets to confirm Prometheus is scraping the model server. Then open Grafana at http://localhost:3000, add Prometheus as a data source (URL: http://prometheus:9090), and start building panels.
Grafana Dashboard Queries
Here are the PromQL queries you want on your dashboard. These go directly into Grafana panel query fields.
Request rate (requests per second, by status):
| |
P50 / P95 / P99 latency:
| |
Prediction output distribution over time – use a heatmap panel with this query:
| |
Error rate percentage:
| |
Feature drift scores – use a time series panel:
| |
For the heatmap panel showing prediction distribution, set the format to “Heatmap” in the Grafana query options. This gives you a visual fingerprint of your model’s output – any color shift means the distribution changed and you should investigate.
Common Errors and Fixes
Prometheus shows target as DOWN. The most common cause is a network issue between containers. Make sure both services are on the same Docker network (Compose does this by default) and that the target hostname matches the service name in docker-compose.yml. Check with docker compose exec prometheus wget -qO- http://model-server:8000/metrics.
Metrics endpoint returns empty or partial data. If you import prometheus_client but never call the metric constructors at module level, the /metrics endpoint will only show default Python process metrics. Declare your Counter, Histogram, and Gauge objects at the top of the module, not inside a function.
Histogram buckets show +Inf only. Your observed values are all above the highest bucket boundary. Adjust the buckets parameter to cover your actual value range. For latency, start with [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5] and widen if your model is slow.
Grafana heatmap panel shows “No data”. You probably have the query format set to “Table” instead of “Heatmap”. In the query editor, change the Format dropdown to “Heatmap”. Also confirm that the time range selector covers a period when the server was actually receiving traffic.
Rate() returns nothing for a new counter. Prometheus needs at least two scrape points to compute a rate. After starting the stack, wait at least two scrape intervals (30 seconds with the default config) before expecting rate queries to return data. Send a few test requests with curl -X POST http://localhost:8000/predict -H "Content-Type: application/json" -d '{"features": [0.5, 0.3, 0.7]}' to seed the metrics.
Alert fires immediately on deploy. The offset comparison in the distribution shift alert will behave unpredictably if there is no data from 24 hours ago. Add a for: 30m clause (already included above) and consider gating the alert with an unless clause that checks for sufficient data history.
Related Guides
- How to Build a Model Performance Alerting Pipeline with Webhooks
- How to Build a Model Endpoint Load Balancer with NGINX and FastAPI
- How to Build Feature Flags for ML Model Rollouts
- How to Build a Model Warm-Up and Health Check Pipeline with FastAPI
- How to Build a Model Load Testing Pipeline with Locust and FastAPI
- How to Build a Model A/B Testing Framework with FastAPI
- How to Build a Model Warm Pool with Preloaded Containers on ECS
- How to Build a Model Explainability Dashboard with SHAP and Streamlit
- How to Build a Model Health Dashboard with FastAPI and SQLite
- How to Build a Model Serving Pipeline with LitServe and Lightning