Most teams discover their inference bill is out of control weeks after the damage is done. The monthly invoice arrives, someone does the math, and suddenly that GPT-4o endpoint your intern wired up is burning $400/day on a feature nobody uses. The fix is per-request cost tracking baked into your serving layer from day one.
Here is a minimal but complete setup. A FastAPI model endpoint instrumented with OpenTelemetry for distributed tracing and prometheus_client for cost metrics. Every request records input tokens, output tokens, the computed cost, and which user or tenant triggered it.
1
2
3
4
5
6
7
8
| # requirements.txt
fastapi==0.115.0
uvicorn==0.32.0
opentelemetry-api==1.28.0
opentelemetry-sdk==1.28.0
opentelemetry-exporter-otlp==1.28.0
prometheus_client==0.21.0
tiktoken==0.8.0
|
Install with pip install -r requirements.txt. The tiktoken library handles token counting for OpenAI-compatible models. If you are serving open-source models, swap in your tokenizer.
Define Model Pricing#
Hard-coding prices in a config dictionary beats calling pricing APIs at request time. Update this when providers change rates – which happens quarterly at most.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
| # pricing.py
from dataclasses import dataclass
@dataclass(frozen=True)
class ModelPricing:
input_cost_per_1k: float # USD per 1,000 input tokens
output_cost_per_1k: float # USD per 1,000 output tokens
# Prices as of early 2026. Check provider docs for current rates.
MODEL_PRICING: dict[str, ModelPricing] = {
"gpt-4o": ModelPricing(
input_cost_per_1k=0.0025,
output_cost_per_1k=0.01,
),
"gpt-4o-mini": ModelPricing(
input_cost_per_1k=0.00015,
output_cost_per_1k=0.0006,
),
"claude-3.5-sonnet": ModelPricing(
input_cost_per_1k=0.003,
output_cost_per_1k=0.015,
),
"claude-3.5-haiku": ModelPricing(
input_cost_per_1k=0.0008,
output_cost_per_1k=0.004,
),
"llama-3.1-70b": ModelPricing(
input_cost_per_1k=0.00035,
output_cost_per_1k=0.0004,
),
"mixtral-8x22b": ModelPricing(
input_cost_per_1k=0.0009,
output_cost_per_1k=0.0009,
),
}
def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float:
"""Return USD cost for a single request. Raises KeyError for unknown models."""
pricing = MODEL_PRICING[model]
input_cost = (input_tokens / 1000) * pricing.input_cost_per_1k
output_cost = (output_tokens / 1000) * pricing.output_cost_per_1k
return round(input_cost + output_cost, 8)
|
The frozen=True on the dataclass prevents accidental mutation. The compute_cost function intentionally raises KeyError for unknown models – you want that to blow up loud rather than silently reporting zero cost.
Set Up OpenTelemetry and Prometheus Metrics#
Wire up OpenTelemetry tracing and Prometheus counters in one place. The tracing gives you per-request spans with cost attributes. The Prometheus metrics give you aggregated totals for dashboards and alerts.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
| # telemetry.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from prometheus_client import Counter, Histogram, Gauge
# --- OpenTelemetry Tracing ---
resource = Resource.create({"service.name": "inference-cost-tracker"})
provider = TracerProvider(resource=resource)
otlp_exporter = OTLPSpanExporter(
endpoint="http://localhost:4317",
insecure=True,
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("inference-cost-tracker")
# --- Prometheus Metrics ---
REQUEST_COST = Counter(
"inference_cost_usd_total",
"Cumulative inference cost in USD",
["model", "tenant"],
)
INPUT_TOKENS = Counter(
"inference_input_tokens_total",
"Total input tokens processed",
["model", "tenant"],
)
OUTPUT_TOKENS = Counter(
"inference_output_tokens_total",
"Total output tokens generated",
["model", "tenant"],
)
REQUEST_LATENCY = Histogram(
"inference_request_duration_seconds",
"End-to-end inference latency",
["model"],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0],
)
COST_PER_REQUEST = Histogram(
"inference_cost_per_request_usd",
"Cost distribution per individual request",
["model"],
buckets=[0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0],
)
COST_ALERT_THRESHOLD = Gauge(
"inference_cost_alert_threshold_usd",
"Current cost alert threshold per tenant per hour",
["tenant"],
)
|
The COST_PER_REQUEST histogram is the most useful metric here. It tells you what a typical request costs for each model, which is the number product managers actually care about.
Build the Instrumented FastAPI Endpoint#
This is the full serving layer. The lifespan context manager handles startup and shutdown – do not use the deprecated @app.on_event decorator, it was removed in recent FastAPI versions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
| # server.py
import time
from contextlib import asynccontextmanager
import tiktoken
from fastapi import FastAPI, Request
from fastapi.responses import Response
from pydantic import BaseModel
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from pricing import MODEL_PRICING, compute_cost
from telemetry import (
tracer,
provider,
REQUEST_COST,
INPUT_TOKENS,
OUTPUT_TOKENS,
REQUEST_LATENCY,
COST_PER_REQUEST,
COST_ALERT_THRESHOLD,
)
# Per-tenant hourly cost thresholds (USD). Set these based on your budget.
TENANT_HOURLY_LIMITS: dict[str, float] = {
"team-search": 50.0,
"team-chat": 100.0,
"team-internal": 10.0,
"default": 25.0,
}
@asynccontextmanager
async def lifespan(app: FastAPI):
# Initialize alert thresholds on startup
for tenant, limit in TENANT_HOURLY_LIMITS.items():
COST_ALERT_THRESHOLD.labels(tenant=tenant).set(limit)
yield
# Flush remaining spans on shutdown
provider.shutdown()
app = FastAPI(title="Inference Cost Tracker", lifespan=lifespan)
class InferenceRequest(BaseModel):
prompt: str
model: str = "gpt-4o-mini"
tenant: str = "default"
max_tokens: int = 512
class InferenceResponse(BaseModel):
text: str
model: str
input_tokens: int
output_tokens: int
cost_usd: float
def count_tokens(text: str, model: str) -> int:
"""Count tokens using tiktoken. Falls back to whitespace split for unknown models."""
try:
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
except KeyError:
# For non-OpenAI models, approximate with cl100k_base
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
@app.post("/v1/inference", response_model=InferenceResponse)
async def run_inference(req: InferenceRequest):
start = time.perf_counter()
with tracer.start_as_current_span("inference_request") as span:
span.set_attribute("model", req.model)
span.set_attribute("tenant", req.tenant)
span.set_attribute("max_tokens", req.max_tokens)
# Count input tokens
input_token_count = count_tokens(req.prompt, req.model)
span.set_attribute("input_tokens", input_token_count)
# --- Your actual model call goes here ---
# Replace this block with your real inference client.
# Example: response = await openai_client.chat.completions.create(...)
generated_text = f"Response to: {req.prompt[:50]}..."
output_token_count = count_tokens(generated_text, req.model)
# --- End model call ---
span.set_attribute("output_tokens", output_token_count)
# Compute cost
try:
request_cost = compute_cost(req.model, input_token_count, output_token_count)
except KeyError:
span.set_attribute("cost.error", f"unknown model: {req.model}")
request_cost = 0.0
span.set_attribute("cost_usd", request_cost)
# Record Prometheus metrics
REQUEST_COST.labels(model=req.model, tenant=req.tenant).inc(request_cost)
INPUT_TOKENS.labels(model=req.model, tenant=req.tenant).inc(input_token_count)
OUTPUT_TOKENS.labels(model=req.model, tenant=req.tenant).inc(output_token_count)
COST_PER_REQUEST.labels(model=req.model).observe(request_cost)
latency = time.perf_counter() - start
REQUEST_LATENCY.labels(model=req.model).observe(latency)
span.set_attribute("latency_seconds", latency)
return InferenceResponse(
text=generated_text,
model=req.model,
input_tokens=input_token_count,
output_tokens=output_token_count,
cost_usd=request_cost,
)
@app.get("/metrics")
async def metrics():
return Response(
content=generate_latest(),
media_type=CONTENT_TYPE_LATEST,
)
@app.get("/health")
async def health():
return {"status": "ok", "models": list(MODEL_PRICING.keys())}
|
Run it with uvicorn server:app --host 0.0.0.0 --port 8000. Hit the /metrics endpoint to see Prometheus-formatted output. Point your Prometheus scrape_configs at port 8000:
1
2
3
4
5
6
| # prometheus.yml (snippet)
scrape_configs:
- job_name: "inference-cost-tracker"
scrape_interval: 15s
static_configs:
- targets: ["localhost:8000"]
|
Build Cost Alerting with Prometheus Rules#
Raw metrics are useless without alerts. This Prometheus alerting rule fires when any tenant’s hourly inference spend exceeds their configured threshold.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| # alert_rules.yml
groups:
- name: inference_cost_alerts
interval: 60s
rules:
- alert: InferenceCostHigh
expr: >
sum by (tenant) (
rate(inference_cost_usd_total[1h])
) * 3600
>
inference_cost_alert_threshold_usd
for: 5m
labels:
severity: warning
annotations:
summary: "Tenant {{ $labels.tenant }} inference cost exceeds hourly limit"
description: >
Tenant {{ $labels.tenant }} is spending ${{ $value | printf "%.2f" }}/hour
on inference, which exceeds the configured threshold.
- alert: InferenceCostPerRequestSpike
expr: >
histogram_quantile(0.95,
rate(inference_cost_per_request_usd_bucket[15m])
) > 0.10
for: 10m
labels:
severity: critical
annotations:
summary: "P95 inference cost per request exceeds $0.10"
description: >
The 95th percentile cost per request is ${{ $value | printf "%.4f" }}.
Check for unexpectedly long prompts or misconfigured model routing.
|
The first rule estimates hourly spend using rate() over a 1-hour window and compares against the per-tenant gauge you set at startup. The second rule catches situations where individual requests get abnormally expensive – which usually means someone is sending 100K-token prompts or your model router sent traffic to GPT-4o instead of the mini variant.
Add the rules file to your Prometheus config:
1
2
3
| # prometheus.yml (add this)
rule_files:
- "alert_rules.yml"
|
Test the Pipeline End to End#
Send some requests and verify the metrics show up correctly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| # Send a few test requests
curl -s -X POST http://localhost:8000/v1/inference \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain transformers in 3 sentences", "model": "gpt-4o-mini", "tenant": "team-search"}' | python3 -m json.tool
# Expected output:
# {
# "text": "Response to: Explain transformers in 3 sentences...",
# "model": "gpt-4o-mini",
# "input_tokens": 6,
# "output_tokens": 11,
# "cost_usd": 7.5e-06
# }
# Check the Prometheus metrics endpoint
curl -s http://localhost:8000/metrics | grep inference_cost
# Expected output (values vary):
# inference_cost_usd_total{model="gpt-4o-mini",tenant="team-search"} 7.5e-06
# inference_cost_per_request_usd_bucket{le="0.0001",model="gpt-4o-mini"} 1.0
# inference_input_tokens_total{model="gpt-4o-mini",tenant="team-search"} 6.0
# inference_output_tokens_total{model="gpt-4o-mini",tenant="team-search"} 11.0
|
To simulate a cost spike and verify alerting, send a batch of requests with the expensive model:
1
2
3
4
5
6
7
8
9
| # Blast 50 requests to simulate load
for i in $(seq 1 50); do
curl -s -X POST http://localhost:8000/v1/inference \
-H "Content-Type: application/json" \
-d "{\"prompt\": \"Write a detailed essay about request number $i with extensive analysis and thorough coverage of all relevant topics\", \"model\": \"gpt-4o\", \"tenant\": \"team-chat\"}" > /dev/null
done
# Check cost accumulation
curl -s http://localhost:8000/metrics | grep 'inference_cost_usd_total{model="gpt-4o"'
|
Common Errors and Fixes#
KeyError: 'gpt-4-turbo' in compute_cost
You sent a request with a model name that is not in your MODEL_PRICING dictionary. The endpoint catches this and sets cost to 0.0, but the span records the error. Fix it by adding the model to pricing.py:
1
2
3
4
| MODEL_PRICING["gpt-4-turbo"] = ModelPricing(
input_cost_per_1k=0.01,
output_cost_per_1k=0.03,
)
|
ConnectionRefusedError: [Errno 111] Connection refused from OTLPSpanExporter
The OpenTelemetry collector is not running at localhost:4317. Either start an OTel collector or switch to console export for local development:
1
2
3
4
| # Replace OTLPSpanExporter with ConsoleSpanExporter for debugging
from opentelemetry.sdk.trace.export import ConsoleSpanExporter
provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
|
This does not block requests – the BatchSpanProcessor drops spans silently when the exporter fails. But your traces will be missing, so check this early.
prometheus_client.errors.ValueError: Duplicated timeseries in CollectorRegistry
This happens when you import the metrics module more than once, typically during hot reload. Uvicorn with --reload re-imports everything, creating duplicate metric registrations. Fix it by using a custom registry:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| from prometheus_client import CollectorRegistry
REGISTRY = CollectorRegistry()
REQUEST_COST = Counter(
"inference_cost_usd_total",
"Cumulative inference cost in USD",
["model", "tenant"],
registry=REGISTRY,
)
# ... register all metrics to REGISTRY
# In the /metrics endpoint:
@app.get("/metrics")
async def metrics():
return Response(
content=generate_latest(REGISTRY),
media_type=CONTENT_TYPE_LATEST,
)
|
tiktoken.model.MODEL_TO_ENCODING KeyError for open-source models
Tiktoken only knows about OpenAI model names. The count_tokens function already handles this by falling back to cl100k_base, but if you need accurate counts for Llama or Mixtral, load the actual tokenizer:
1
2
3
4
5
6
7
8
| from transformers import AutoTokenizer
_tokenizer_cache: dict[str, AutoTokenizer] = {}
def count_tokens_hf(text: str, model_id: str) -> int:
if model_id not in _tokenizer_cache:
_tokenizer_cache[model_id] = AutoTokenizer.from_pretrained(model_id)
return len(_tokenizer_cache[model_id].encode(text))
|
rate() returning NaN in Prometheus alert rules
Prometheus needs at least two data points to compute a rate. If your scrape interval is 15 seconds and you just started the server, wait at least 30 seconds before the rate() function returns meaningful values. For alerting rules with [1h] windows, you need an hour of data before the expression evaluates correctly.