Most teams discover their inference bill is out of control weeks after the damage is done. The monthly invoice arrives, someone does the math, and suddenly that GPT-4o endpoint your intern wired up is burning $400/day on a feature nobody uses. The fix is per-request cost tracking baked into your serving layer from day one.

Here is a minimal but complete setup. A FastAPI model endpoint instrumented with OpenTelemetry for distributed tracing and prometheus_client for cost metrics. Every request records input tokens, output tokens, the computed cost, and which user or tenant triggered it.

1
2
3
4
5
6
7
8
# requirements.txt
fastapi==0.115.0
uvicorn==0.32.0
opentelemetry-api==1.28.0
opentelemetry-sdk==1.28.0
opentelemetry-exporter-otlp==1.28.0
prometheus_client==0.21.0
tiktoken==0.8.0

Install with pip install -r requirements.txt. The tiktoken library handles token counting for OpenAI-compatible models. If you are serving open-source models, swap in your tokenizer.

Define Model Pricing

Hard-coding prices in a config dictionary beats calling pricing APIs at request time. Update this when providers change rates – which happens quarterly at most.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# pricing.py

from dataclasses import dataclass


@dataclass(frozen=True)
class ModelPricing:
    input_cost_per_1k: float   # USD per 1,000 input tokens
    output_cost_per_1k: float  # USD per 1,000 output tokens


# Prices as of early 2026. Check provider docs for current rates.
MODEL_PRICING: dict[str, ModelPricing] = {
    "gpt-4o": ModelPricing(
        input_cost_per_1k=0.0025,
        output_cost_per_1k=0.01,
    ),
    "gpt-4o-mini": ModelPricing(
        input_cost_per_1k=0.00015,
        output_cost_per_1k=0.0006,
    ),
    "claude-3.5-sonnet": ModelPricing(
        input_cost_per_1k=0.003,
        output_cost_per_1k=0.015,
    ),
    "claude-3.5-haiku": ModelPricing(
        input_cost_per_1k=0.0008,
        output_cost_per_1k=0.004,
    ),
    "llama-3.1-70b": ModelPricing(
        input_cost_per_1k=0.00035,
        output_cost_per_1k=0.0004,
    ),
    "mixtral-8x22b": ModelPricing(
        input_cost_per_1k=0.0009,
        output_cost_per_1k=0.0009,
    ),
}


def compute_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    """Return USD cost for a single request. Raises KeyError for unknown models."""
    pricing = MODEL_PRICING[model]
    input_cost = (input_tokens / 1000) * pricing.input_cost_per_1k
    output_cost = (output_tokens / 1000) * pricing.output_cost_per_1k
    return round(input_cost + output_cost, 8)

The frozen=True on the dataclass prevents accidental mutation. The compute_cost function intentionally raises KeyError for unknown models – you want that to blow up loud rather than silently reporting zero cost.

Set Up OpenTelemetry and Prometheus Metrics

Wire up OpenTelemetry tracing and Prometheus counters in one place. The tracing gives you per-request spans with cost attributes. The Prometheus metrics give you aggregated totals for dashboards and alerts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# telemetry.py

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from prometheus_client import Counter, Histogram, Gauge

# --- OpenTelemetry Tracing ---
resource = Resource.create({"service.name": "inference-cost-tracker"})
provider = TracerProvider(resource=resource)

otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317",
    insecure=True,
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("inference-cost-tracker")

# --- Prometheus Metrics ---
REQUEST_COST = Counter(
    "inference_cost_usd_total",
    "Cumulative inference cost in USD",
    ["model", "tenant"],
)
INPUT_TOKENS = Counter(
    "inference_input_tokens_total",
    "Total input tokens processed",
    ["model", "tenant"],
)
OUTPUT_TOKENS = Counter(
    "inference_output_tokens_total",
    "Total output tokens generated",
    ["model", "tenant"],
)
REQUEST_LATENCY = Histogram(
    "inference_request_duration_seconds",
    "End-to-end inference latency",
    ["model"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 30.0],
)
COST_PER_REQUEST = Histogram(
    "inference_cost_per_request_usd",
    "Cost distribution per individual request",
    ["model"],
    buckets=[0.0001, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0],
)
COST_ALERT_THRESHOLD = Gauge(
    "inference_cost_alert_threshold_usd",
    "Current cost alert threshold per tenant per hour",
    ["tenant"],
)

The COST_PER_REQUEST histogram is the most useful metric here. It tells you what a typical request costs for each model, which is the number product managers actually care about.

Build the Instrumented FastAPI Endpoint

This is the full serving layer. The lifespan context manager handles startup and shutdown – do not use the deprecated @app.on_event decorator, it was removed in recent FastAPI versions.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# server.py

import time
from contextlib import asynccontextmanager

import tiktoken
from fastapi import FastAPI, Request
from fastapi.responses import Response
from pydantic import BaseModel
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST

from pricing import MODEL_PRICING, compute_cost
from telemetry import (
    tracer,
    provider,
    REQUEST_COST,
    INPUT_TOKENS,
    OUTPUT_TOKENS,
    REQUEST_LATENCY,
    COST_PER_REQUEST,
    COST_ALERT_THRESHOLD,
)

# Per-tenant hourly cost thresholds (USD). Set these based on your budget.
TENANT_HOURLY_LIMITS: dict[str, float] = {
    "team-search": 50.0,
    "team-chat": 100.0,
    "team-internal": 10.0,
    "default": 25.0,
}


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Initialize alert thresholds on startup
    for tenant, limit in TENANT_HOURLY_LIMITS.items():
        COST_ALERT_THRESHOLD.labels(tenant=tenant).set(limit)
    yield
    # Flush remaining spans on shutdown
    provider.shutdown()


app = FastAPI(title="Inference Cost Tracker", lifespan=lifespan)


class InferenceRequest(BaseModel):
    prompt: str
    model: str = "gpt-4o-mini"
    tenant: str = "default"
    max_tokens: int = 512


class InferenceResponse(BaseModel):
    text: str
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float


def count_tokens(text: str, model: str) -> int:
    """Count tokens using tiktoken. Falls back to whitespace split for unknown models."""
    try:
        encoding = tiktoken.encoding_for_model(model)
        return len(encoding.encode(text))
    except KeyError:
        # For non-OpenAI models, approximate with cl100k_base
        encoding = tiktoken.get_encoding("cl100k_base")
        return len(encoding.encode(text))


@app.post("/v1/inference", response_model=InferenceResponse)
async def run_inference(req: InferenceRequest):
    start = time.perf_counter()

    with tracer.start_as_current_span("inference_request") as span:
        span.set_attribute("model", req.model)
        span.set_attribute("tenant", req.tenant)
        span.set_attribute("max_tokens", req.max_tokens)

        # Count input tokens
        input_token_count = count_tokens(req.prompt, req.model)
        span.set_attribute("input_tokens", input_token_count)

        # --- Your actual model call goes here ---
        # Replace this block with your real inference client.
        # Example: response = await openai_client.chat.completions.create(...)
        generated_text = f"Response to: {req.prompt[:50]}..."
        output_token_count = count_tokens(generated_text, req.model)
        # --- End model call ---

        span.set_attribute("output_tokens", output_token_count)

        # Compute cost
        try:
            request_cost = compute_cost(req.model, input_token_count, output_token_count)
        except KeyError:
            span.set_attribute("cost.error", f"unknown model: {req.model}")
            request_cost = 0.0

        span.set_attribute("cost_usd", request_cost)

        # Record Prometheus metrics
        REQUEST_COST.labels(model=req.model, tenant=req.tenant).inc(request_cost)
        INPUT_TOKENS.labels(model=req.model, tenant=req.tenant).inc(input_token_count)
        OUTPUT_TOKENS.labels(model=req.model, tenant=req.tenant).inc(output_token_count)
        COST_PER_REQUEST.labels(model=req.model).observe(request_cost)

        latency = time.perf_counter() - start
        REQUEST_LATENCY.labels(model=req.model).observe(latency)

        span.set_attribute("latency_seconds", latency)

    return InferenceResponse(
        text=generated_text,
        model=req.model,
        input_tokens=input_token_count,
        output_tokens=output_token_count,
        cost_usd=request_cost,
    )


@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST,
    )


@app.get("/health")
async def health():
    return {"status": "ok", "models": list(MODEL_PRICING.keys())}

Run it with uvicorn server:app --host 0.0.0.0 --port 8000. Hit the /metrics endpoint to see Prometheus-formatted output. Point your Prometheus scrape_configs at port 8000:

1
2
3
4
5
6
# prometheus.yml (snippet)
scrape_configs:
  - job_name: "inference-cost-tracker"
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:8000"]

Build Cost Alerting with Prometheus Rules

Raw metrics are useless without alerts. This Prometheus alerting rule fires when any tenant’s hourly inference spend exceeds their configured threshold.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# alert_rules.yml
groups:
  - name: inference_cost_alerts
    interval: 60s
    rules:
      - alert: InferenceCostHigh
        expr: >
          sum by (tenant) (
            rate(inference_cost_usd_total[1h])
          ) * 3600
          >
          inference_cost_alert_threshold_usd
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tenant {{ $labels.tenant }} inference cost exceeds hourly limit"
          description: >
            Tenant {{ $labels.tenant }} is spending ${{ $value | printf "%.2f" }}/hour
            on inference, which exceeds the configured threshold.

      - alert: InferenceCostPerRequestSpike
        expr: >
          histogram_quantile(0.95,
            rate(inference_cost_per_request_usd_bucket[15m])
          ) > 0.10
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "P95 inference cost per request exceeds $0.10"
          description: >
            The 95th percentile cost per request is ${{ $value | printf "%.4f" }}.
            Check for unexpectedly long prompts or misconfigured model routing.

The first rule estimates hourly spend using rate() over a 1-hour window and compares against the per-tenant gauge you set at startup. The second rule catches situations where individual requests get abnormally expensive – which usually means someone is sending 100K-token prompts or your model router sent traffic to GPT-4o instead of the mini variant.

Add the rules file to your Prometheus config:

1
2
3
# prometheus.yml (add this)
rule_files:
  - "alert_rules.yml"

Test the Pipeline End to End

Send some requests and verify the metrics show up correctly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Send a few test requests
curl -s -X POST http://localhost:8000/v1/inference \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain transformers in 3 sentences", "model": "gpt-4o-mini", "tenant": "team-search"}' | python3 -m json.tool

# Expected output:
# {
#     "text": "Response to: Explain transformers in 3 sentences...",
#     "model": "gpt-4o-mini",
#     "input_tokens": 6,
#     "output_tokens": 11,
#     "cost_usd": 7.5e-06
# }

# Check the Prometheus metrics endpoint
curl -s http://localhost:8000/metrics | grep inference_cost

# Expected output (values vary):
# inference_cost_usd_total{model="gpt-4o-mini",tenant="team-search"} 7.5e-06
# inference_cost_per_request_usd_bucket{le="0.0001",model="gpt-4o-mini"} 1.0
# inference_input_tokens_total{model="gpt-4o-mini",tenant="team-search"} 6.0
# inference_output_tokens_total{model="gpt-4o-mini",tenant="team-search"} 11.0

To simulate a cost spike and verify alerting, send a batch of requests with the expensive model:

1
2
3
4
5
6
7
8
9
# Blast 50 requests to simulate load
for i in $(seq 1 50); do
  curl -s -X POST http://localhost:8000/v1/inference \
    -H "Content-Type: application/json" \
    -d "{\"prompt\": \"Write a detailed essay about request number $i with extensive analysis and thorough coverage of all relevant topics\", \"model\": \"gpt-4o\", \"tenant\": \"team-chat\"}" > /dev/null
done

# Check cost accumulation
curl -s http://localhost:8000/metrics | grep 'inference_cost_usd_total{model="gpt-4o"'

Common Errors and Fixes

KeyError: 'gpt-4-turbo' in compute_cost

You sent a request with a model name that is not in your MODEL_PRICING dictionary. The endpoint catches this and sets cost to 0.0, but the span records the error. Fix it by adding the model to pricing.py:

1
2
3
4
MODEL_PRICING["gpt-4-turbo"] = ModelPricing(
    input_cost_per_1k=0.01,
    output_cost_per_1k=0.03,
)

ConnectionRefusedError: [Errno 111] Connection refused from OTLPSpanExporter

The OpenTelemetry collector is not running at localhost:4317. Either start an OTel collector or switch to console export for local development:

1
2
3
4
# Replace OTLPSpanExporter with ConsoleSpanExporter for debugging
from opentelemetry.sdk.trace.export import ConsoleSpanExporter

provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))

This does not block requests – the BatchSpanProcessor drops spans silently when the exporter fails. But your traces will be missing, so check this early.

prometheus_client.errors.ValueError: Duplicated timeseries in CollectorRegistry

This happens when you import the metrics module more than once, typically during hot reload. Uvicorn with --reload re-imports everything, creating duplicate metric registrations. Fix it by using a custom registry:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from prometheus_client import CollectorRegistry

REGISTRY = CollectorRegistry()

REQUEST_COST = Counter(
    "inference_cost_usd_total",
    "Cumulative inference cost in USD",
    ["model", "tenant"],
    registry=REGISTRY,
)
# ... register all metrics to REGISTRY

# In the /metrics endpoint:
@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(REGISTRY),
        media_type=CONTENT_TYPE_LATEST,
    )

tiktoken.model.MODEL_TO_ENCODING KeyError for open-source models

Tiktoken only knows about OpenAI model names. The count_tokens function already handles this by falling back to cl100k_base, but if you need accurate counts for Llama or Mixtral, load the actual tokenizer:

1
2
3
4
5
6
7
8
from transformers import AutoTokenizer

_tokenizer_cache: dict[str, AutoTokenizer] = {}

def count_tokens_hf(text: str, model_id: str) -> int:
    if model_id not in _tokenizer_cache:
        _tokenizer_cache[model_id] = AutoTokenizer.from_pretrained(model_id)
    return len(_tokenizer_cache[model_id].encode(text))

rate() returning NaN in Prometheus alert rules

Prometheus needs at least two data points to compute a rate. If your scrape interval is 15 seconds and you just started the server, wait at least 30 seconds before the rate() function returns meaningful values. For alerting rules with [1h] windows, you need an hour of data before the expression evaluates correctly.