You’re running three models behind a FastAPI server. Traffic is growing. The GPU bill arrives and you have no idea which model ate 80% of the budget. Sound familiar?

The fix is a cost dashboard that ties every request back to a dollar estimate. You instrument your server with Prometheus counters, pipe them into Grafana, and suddenly you can see cost per model, cost per endpoint, and daily trends in real time. Here’s how to build it.

Instrumenting Your Model Server with Prometheus

Start with a FastAPI server that exposes custom Prometheus metrics. You need four things: a request counter, a token counter, a latency histogram, and a GPU utilization gauge.

Install the dependencies:

1
pip install fastapi uvicorn prometheus_client torch transformers

Now build the server. This uses FastAPI’s lifespan context manager for model loading – not the deprecated @app.on_event decorator.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
import time
from contextlib import asynccontextmanager
from typing import Any

import torch
from fastapi import FastAPI, Request
from prometheus_client import Counter, Gauge, Histogram, generate_latest
from starlette.responses import Response
from transformers import AutoModelForCausalLM, AutoTokenizer

# --- Prometheus metrics ---
REQUEST_COUNT = Counter(
    "model_request_total",
    "Total inference requests",
    ["model_name", "endpoint"],
)
TOKEN_COUNT = Counter(
    "model_tokens_total",
    "Total tokens processed",
    ["model_name", "direction"],  # direction: input or output
)
LATENCY = Histogram(
    "model_request_duration_seconds",
    "Request latency in seconds",
    ["model_name"],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)
GPU_UTILIZATION = Gauge(
    "gpu_utilization_percent",
    "Current GPU memory utilization",
)
COST_PER_REQUEST = Counter(
    "model_estimated_cost_dollars",
    "Estimated cost in USD per request",
    ["model_name"],
)

models: dict[str, Any] = {}
tokenizers: dict[str, Any] = {}

MODEL_REGISTRY = {
    "distilgpt2": "distilgpt2",
    "gpt2": "gpt2",
}

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load models on startup
    for name, model_id in MODEL_REGISTRY.items():
        tokenizers[name] = AutoTokenizer.from_pretrained(model_id)
        models[name] = AutoModelForCausalLM.from_pretrained(model_id)
        if torch.cuda.is_available():
            models[name] = models[name].to("cuda")
        print(f"Loaded {name}")
    yield
    # Cleanup on shutdown
    models.clear()
    tokenizers.clear()

app = FastAPI(lifespan=lifespan)

# Cost constants -- adjust to your hardware
GPU_COST_PER_HOUR = 2.50  # e.g., A10G on AWS
GPU_COST_PER_SECOND = GPU_COST_PER_HOUR / 3600


@app.get("/metrics")
async def metrics():
    if torch.cuda.is_available():
        mem_used = torch.cuda.memory_allocated() / torch.cuda.get_device_properties(0).total_mem * 100
        GPU_UTILIZATION.set(round(mem_used, 2))
    return Response(content=generate_latest(), media_type="text/plain")


@app.post("/v1/generate/{model_name}")
async def generate(model_name: str, request: Request):
    if model_name not in models:
        return {"error": f"Model {model_name} not found"}

    body = await request.json()
    prompt = body.get("prompt", "Hello")
    max_tokens = body.get("max_tokens", 50)

    tokenizer = tokenizers[model_name]
    model = models[model_name]

    inputs = tokenizer(prompt, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.to("cuda") for k, v in inputs.items()}

    input_token_count = inputs["input_ids"].shape[1]

    start = time.perf_counter()
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    elapsed = time.perf_counter() - start

    output_token_count = outputs.shape[1] - input_token_count
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Record Prometheus metrics
    REQUEST_COUNT.labels(model_name=model_name, endpoint="/v1/generate").inc()
    TOKEN_COUNT.labels(model_name=model_name, direction="input").inc(input_token_count)
    TOKEN_COUNT.labels(model_name=model_name, direction="output").inc(output_token_count)
    LATENCY.labels(model_name=model_name).observe(elapsed)

    # Estimate cost and record it
    estimated_cost = elapsed * GPU_COST_PER_SECOND
    COST_PER_REQUEST.labels(model_name=model_name).inc(estimated_cost)

    return {
        "model": model_name,
        "text": generated_text,
        "input_tokens": input_token_count,
        "output_tokens": output_token_count,
        "latency_seconds": round(elapsed, 4),
        "estimated_cost_usd": round(estimated_cost, 6),
    }

Run it with uvicorn main:app --host 0.0.0.0 --port 8000. Hit /metrics to see raw Prometheus output.

Calculating Cost Per Request

The key metric is model_estimated_cost_dollars. It multiplies GPU wall-clock time by your hourly rate. This works for self-hosted models where you own or rent the GPU.

The formula is simple:

1
cost = request_latency_seconds * (gpu_cost_per_hour / 3600)

For a request that takes 0.8 seconds on a $2.50/hour A10G, that’s 0.8 * 0.000694 = $0.000556. Small per request, but at 100K requests per day that’s $55.56 daily.

If you want cost-per-token instead (useful for comparing against API pricing), add this PromQL query in Grafana:

1
2
3
rate(model_estimated_cost_dollars_total[5m])
  /
rate(model_tokens_total{direction="output"}[5m])

This gives you real-time cost per output token, broken down by model. You can directly compare it against OpenAI or Anthropic API pricing to decide whether self-hosting saves money.

Setting Up Grafana Dashboards

First, make sure Prometheus scrapes your server. Add this to prometheus.yml:

1
2
3
4
5
scrape_configs:
  - job_name: "model-server"
    scrape_interval: 15s
    static_configs:
      - targets: ["host.docker.internal:8000"]

Now create a Grafana dashboard. You can import this JSON via Grafana’s dashboard import (Settings > JSON Model), or save it as a provisioned dashboard file.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
{
  "dashboard": {
    "title": "Model Serving Cost Dashboard",
    "uid": "model-cost-v1",
    "timezone": "browser",
    "refresh": "30s",
    "panels": [
      {
        "title": "Cumulative Cost by Model ($)",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "model_estimated_cost_dollars_total",
            "legendFormat": "{{ model_name }}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "currencyUSD" }
        }
      },
      {
        "title": "Cost Rate ($/min) by Model",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
        "targets": [
          {
            "expr": "rate(model_estimated_cost_dollars_total[5m]) * 60",
            "legendFormat": "{{ model_name }}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "currencyUSD" }
        }
      },
      {
        "title": "Daily Estimated Cost by Model ($)",
        "type": "stat",
        "gridPos": { "h": 6, "w": 12, "x": 0, "y": 8 },
        "targets": [
          {
            "expr": "increase(model_estimated_cost_dollars_total[24h])",
            "legendFormat": "{{ model_name }}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "currencyUSD", "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 50 },
              { "color": "red", "value": 100 }
            ]
          }}
        }
      },
      {
        "title": "Cost per 1K Output Tokens",
        "type": "timeseries",
        "gridPos": { "h": 6, "w": 12, "x": 12, "y": 8 },
        "targets": [
          {
            "expr": "(rate(model_estimated_cost_dollars_total[5m]) / rate(model_tokens_total{direction=\"output\"}[5m])) * 1000",
            "legendFormat": "{{ model_name }}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "currencyUSD" }
        }
      },
      {
        "title": "Request Rate by Endpoint",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 14 },
        "targets": [
          {
            "expr": "rate(model_request_total[5m])",
            "legendFormat": "{{ model_name }} - {{ endpoint }}"
          }
        ]
      },
      {
        "title": "GPU Memory Utilization (%)",
        "type": "gauge",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 14 },
        "targets": [
          {
            "expr": "gpu_utilization_percent"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "percent", "max": 100, "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 70 },
              { "color": "red", "value": 90 }
            ]
          }}
        }
      }
    ]
  }
}

Save this as dashboards/model-cost.json and point Grafana’s provisioning config at it, or paste it into the dashboard JSON editor directly.

Alerting on Cost Spikes

Grafana alerting can fire when your daily cost crosses a threshold. Go to Alerting > Alert Rules > New Alert Rule, or define it in a provisioning file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# alerting/cost-alerts.yaml
apiVersion: 1
groups:
  - orgId: 1
    name: model-cost-alerts
    folder: MLOps
    interval: 5m
    rules:
      - uid: daily-cost-spike
        title: "Daily model cost exceeds $100"
        condition: C
        data:
          - refId: A
            relativeTimeRange:
              from: 86400  # 24 hours in seconds
              to: 0
            datasourceUid: prometheus
            model:
              expr: increase(model_estimated_cost_dollars_total[24h])
              intervalMs: 60000
              maxDataPoints: 43200
          - refId: B
            relativeTimeRange:
              from: 0
              to: 0
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: sum
          - refId: C
            relativeTimeRange:
              from: 0
              to: 0
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [100]
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Daily inference cost exceeded $100"
          description: "Total model serving cost in the last 24h is {{ $values.B }}. Check which model is driving the spike."

Drop this into Grafana’s provisioning directory (usually /etc/grafana/provisioning/alerting/) and restart Grafana. The alert evaluates every 5 minutes and fires if the 24-hour cost stays above $100 for at least 10 minutes. Wire it to a Slack or PagerDuty contact point so you hear about it immediately.

Common Errors and Fixes

Prometheus scrape timeout on /metrics

If your model server is slow to respond (loading large models), Prometheus may time out. Increase the scrape timeout:

1
2
3
4
5
6
scrape_configs:
  - job_name: "model-server"
    scrape_interval: 15s
    scrape_timeout: 10s  # default is 10s, increase if needed
    static_configs:
      - targets: ["host.docker.internal:8000"]

Also check that the /metrics endpoint doesn’t trigger model inference. The endpoint shown above only reads torch.cuda memory stats, which is fast.

Metric cardinality explosion

If you add labels like user_id or request_id to your Prometheus counters, you’ll create a new time series for every unique value. Prometheus will eat your RAM and eventually crash. Stick to low-cardinality labels: model_name, endpoint, direction. If you need per-user tracking, push that to a separate system like BigQuery or ClickHouse.

Grafana “No data” for Prometheus queries

Usually one of three things:

  1. The datasource URL is wrong. Go to Connections > Data Sources > Prometheus and verify the URL. If both run in Docker, use http://prometheus:9090, not localhost.
  2. The metric name has a _total suffix that Prometheus auto-appended. Try querying model_estimated_cost_dollars_total instead of model_estimated_cost_dollars.
  3. The time range is too narrow. Counters only show data after they’ve been incremented at least once. Send a few test requests first.

GPU utilization reads 0 on CPU machines

The torch.cuda.is_available() guard handles this – on CPU-only boxes the gauge just won’t update. If you want CPU cost tracking, swap in a metric based on time.process_time() instead and adjust the hourly rate to your CPU instance cost.