The Setup: Instrument, Scrape, Visualize

You need three things: a FastAPI model server that exposes Prometheus metrics, a Prometheus instance that scrapes them, and Grafana dashboards that make those numbers useful. The whole stack runs in Docker Compose and takes about 20 minutes to wire up.

Here is the instrumented FastAPI server. This is the core of the whole system – every other component just reads from what this exposes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# app/server.py
import time
import numpy as np
from fastapi import FastAPI, Request
from pydantic import BaseModel
from prometheus_client import (
    Counter, Histogram, Gauge, Info, generate_latest, CONTENT_TYPE_LATEST
)
from starlette.responses import Response

app = FastAPI()

# --- Metrics ---
REQUEST_COUNT = Counter(
    "model_request_total",
    "Total prediction requests",
    ["model_name", "status"]
)
REQUEST_LATENCY = Histogram(
    "model_request_latency_seconds",
    "Prediction latency in seconds",
    ["model_name"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
PREDICTION_VALUE = Histogram(
    "model_prediction_value",
    "Distribution of prediction outputs",
    ["model_name"],
    buckets=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)
FEATURE_DRIFT = Gauge(
    "model_feature_drift_score",
    "PSI-based feature drift indicator",
    ["model_name", "feature_name"]
)
MODEL_INFO = Info("model_version", "Currently loaded model metadata")
MODEL_INFO.info({"version": "v2.3.1", "framework": "xgboost", "trained_at": "2026-02-10"})


class PredictionRequest(BaseModel):
    features: list[float]


@app.post("/predict")
async def predict(req: PredictionRequest):
    model_name = "fraud_detector"
    start = time.perf_counter()

    try:
        # Your actual model inference goes here.
        # This simulates a prediction for demonstration.
        score = float(np.mean(req.features) * 0.7 + np.random.normal(0, 0.05))
        score = max(0.0, min(1.0, score))

        latency = time.perf_counter() - start
        REQUEST_COUNT.labels(model_name=model_name, status="success").inc()
        REQUEST_LATENCY.labels(model_name=model_name).observe(latency)
        PREDICTION_VALUE.labels(model_name=model_name).observe(score)

        # Simulate drift scoring on the first two features
        if len(req.features) >= 2:
            FEATURE_DRIFT.labels(model_name=model_name, feature_name="amount").set(
                abs(req.features[0] - 0.5)  # distance from training mean
            )
            FEATURE_DRIFT.labels(model_name=model_name, feature_name="frequency").set(
                abs(req.features[1] - 0.3)
            )

        return {"score": round(score, 4), "latency_ms": round(latency * 1000, 2)}

    except Exception as e:
        REQUEST_COUNT.labels(model_name=model_name, status="error").inc()
        raise


@app.get("/metrics")
async def metrics():
    return Response(content=generate_latest(), media_type=CONTENT_TYPE_LATEST)

The key decisions here: use a Histogram for latency (not a Summary) because histograms let Prometheus compute arbitrary percentiles server-side. The prediction value histogram tracks output distribution – when that shape changes, your model or your input data has drifted. The Gauge for feature drift lets you push a computed drift score that Grafana can threshold on.

Prometheus Configuration

Prometheus needs to know where to scrape. Create prometheus/prometheus.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: "model-server"
    metrics_path: "/metrics"
    static_configs:
      - targets: ["model-server:8000"]
        labels:
          environment: "production"

The 15-second scrape interval is a good default. Going lower than 10 seconds creates a lot of storage churn for marginal benefit. Going higher than 30 seconds means you miss short latency spikes.

Alerting Rules

This is where monitoring turns into something actionable. Create prometheus/alerts.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# prometheus/alerts.yml
groups:
  - name: model_alerts
    rules:
      - alert: HighPredictionLatency
        expr: histogram_quantile(0.95, rate(model_request_latency_seconds_bucket[5m])) > 0.5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "P95 prediction latency above 500ms for {{ $labels.model_name }}"

      - alert: HighErrorRate
        expr: >
          rate(model_request_total{status="error"}[5m])
          / rate(model_request_total[5m]) > 0.05
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Error rate above 5% for {{ $labels.model_name }}"

      - alert: PredictionDistributionShift
        expr: >
          abs(
            histogram_quantile(0.5, rate(model_prediction_value_bucket[1h]))
            - histogram_quantile(0.5, rate(model_prediction_value_bucket[1h] offset 24h))
          ) > 0.15
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Median prediction shifted >0.15 compared to 24h ago for {{ $labels.model_name }}"

      - alert: FeatureDriftDetected
        expr: model_feature_drift_score > 0.3
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Feature {{ $labels.feature_name }} drift score above threshold"

The PredictionDistributionShift alert is the most interesting one. It compares the median prediction right now against the median from 24 hours ago. A shift of 0.15 on a 0-1 scale is significant enough to investigate but not so sensitive that normal traffic variation triggers it. Tune this threshold based on your model’s output range.

Docker Compose Stack

Wire everything together with Compose:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# docker-compose.yml
services:
  model-server:
    build: .
    ports:
      - "8000:8000"
    command: uvicorn app.server:app --host 0.0.0.0 --port 8000

  prometheus:
    image: prom/prometheus:v2.51.0
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:10.4.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  grafana-data:

Start everything with docker compose up -d. Hit http://localhost:9090/targets to confirm Prometheus is scraping the model server. Then open Grafana at http://localhost:3000, add Prometheus as a data source (URL: http://prometheus:9090), and start building panels.

Grafana Dashboard Queries

Here are the PromQL queries you want on your dashboard. These go directly into Grafana panel query fields.

Request rate (requests per second, by status):

1
sum(rate(model_request_total[5m])) by (status)

P50 / P95 / P99 latency:

1
histogram_quantile(0.95, sum(rate(model_request_latency_seconds_bucket[5m])) by (le))

Prediction output distribution over time – use a heatmap panel with this query:

1
sum(rate(model_prediction_value_bucket[5m])) by (le)

Error rate percentage:

1
100 * sum(rate(model_request_total{status="error"}[5m])) / sum(rate(model_request_total[5m]))

Feature drift scores – use a time series panel:

1
model_feature_drift_score

For the heatmap panel showing prediction distribution, set the format to “Heatmap” in the Grafana query options. This gives you a visual fingerprint of your model’s output – any color shift means the distribution changed and you should investigate.

Common Errors and Fixes

Prometheus shows target as DOWN. The most common cause is a network issue between containers. Make sure both services are on the same Docker network (Compose does this by default) and that the target hostname matches the service name in docker-compose.yml. Check with docker compose exec prometheus wget -qO- http://model-server:8000/metrics.

Metrics endpoint returns empty or partial data. If you import prometheus_client but never call the metric constructors at module level, the /metrics endpoint will only show default Python process metrics. Declare your Counter, Histogram, and Gauge objects at the top of the module, not inside a function.

Histogram buckets show +Inf only. Your observed values are all above the highest bucket boundary. Adjust the buckets parameter to cover your actual value range. For latency, start with [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5] and widen if your model is slow.

Grafana heatmap panel shows “No data”. You probably have the query format set to “Table” instead of “Heatmap”. In the query editor, change the Format dropdown to “Heatmap”. Also confirm that the time range selector covers a period when the server was actually receiving traffic.

Rate() returns nothing for a new counter. Prometheus needs at least two scrape points to compute a rate. After starting the stack, wait at least two scrape intervals (30 seconds with the default config) before expecting rate queries to return data. Send a few test requests with curl -X POST http://localhost:8000/predict -H "Content-Type: application/json" -d '{"features": [0.5, 0.3, 0.7]}' to seed the metrics.

Alert fires immediately on deploy. The offset comparison in the distribution shift alert will behave unpredictably if there is no data from 24 hours ago. Add a for: 30m clause (already included above) and consider gating the alert with an unless clause that checks for sufficient data history.