How to Build a Model Serving Autoscaler with Custom Metrics and Kubernetes

The default Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU and memory. That works fine for web servers, but ML model serving has a different bottleneck profile. A pod can sit at 30% CPU while inference latency spikes because the GPU is saturated, the request queue is backing up, or batch sizes are too large. You need to scale on metrics that actually reflect serving performance – inference latency, queue depth, and GPU utilization.

This guide builds a complete custom metrics autoscaling pipeline: a FastAPI model server that exposes Prometheus metrics, a prometheus-adapter that makes those metrics available to the Kubernetes API, and an HPA that scales pods based on real inference behavior.

Exposing Custom Metrics from Your Model Server

The model server needs to export three Prometheus metrics: a histogram for inference latency, a gauge for current queue depth, and a gauge for GPU utilization. The prometheus_client library handles the Prometheus exposition format, and you serve the /metrics endpoint alongside your prediction API.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# server.py
import asyncio
import time
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager

import torch
from fastapi import FastAPI, Request
from prometheus_client import (
    CONTENT_TYPE_LATEST,
    CollectorRegistry,
    Gauge,
    Histogram,
    generate_latest,
)
from pydantic import BaseModel
from starlette.responses import Response

# Create a custom registry to avoid default process/platform metrics clutter
registry = CollectorRegistry()

INFERENCE_LATENCY = Histogram(
    "inference_latency_seconds",
    "Time spent on model inference",
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
    registry=registry,
)

REQUEST_QUEUE_DEPTH = Gauge(
    "request_queue_depth",
    "Number of inference requests currently waiting or in-flight",
    registry=registry,
)

GPU_UTILIZATION = Gauge(
    "gpu_utilization_percent",
    "Current GPU utilization percentage",
    registry=registry,
)

model_state: dict = {}


def get_gpu_utilization() -> float:
    """Read GPU utilization from nvidia-smi via pynvml, fallback to 0."""
    try:
        import pynvml

        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
        pynvml.nvmlShutdown()
        return float(util.gpu)
    except Exception:
        return 0.0


async def update_gpu_metrics() -> None:
    """Background task that polls GPU utilization every 5 seconds."""
    while True:
        GPU_UTILIZATION.set(get_gpu_utilization())
        await asyncio.sleep(5)


@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
    # Load model at startup
    print("Loading model...")
    model_state["model"] = torch.nn.Linear(768, 2)
    model_state["model"].eval()
    # Start background GPU metrics polling
    task = asyncio.create_task(update_gpu_metrics())
    yield
    task.cancel()
    model_state.clear()
    print("Model unloaded")


app = FastAPI(lifespan=lifespan)


class PredictRequest(BaseModel):
    features: list[float]


class PredictResponse(BaseModel):
    prediction: list[float]
    latency_ms: float


@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
    REQUEST_QUEUE_DEPTH.inc()
    try:
        start = time.perf_counter()
        tensor_input = torch.tensor([req.features[:768]], dtype=torch.float32)
        # Pad if input is shorter than expected
        if tensor_input.shape[1] < 768:
            padding = torch.zeros(1, 768 - tensor_input.shape[1])
            tensor_input = torch.cat([tensor_input, padding], dim=1)
        with torch.no_grad():
            output = model_state["model"](tensor_input)
        elapsed = time.perf_counter() - start
        INFERENCE_LATENCY.observe(elapsed)
        return PredictResponse(
            prediction=output[0].tolist(),
            latency_ms=round(elapsed * 1000, 2),
        )
    finally:
        REQUEST_QUEUE_DEPTH.dec()


@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(registry),
        media_type=CONTENT_TYPE_LATEST,
    )


@app.get("/health")
async def health():
    if "model" not in model_state:
        return Response(status_code=503, content="model not loaded")
    return {"status": "healthy"}

Install the dependencies:

1
pip install fastapi uvicorn prometheus_client torch pynvml pydantic

Run locally to verify metrics work:

1
2
3
uvicorn server:app --host 0.0.0.0 --port 8080
# In another terminal:
curl -s http://localhost:8080/metrics | grep inference_latency

You should see the histogram buckets and counts. After a few /predict requests, the inference_latency_seconds histogram populates and request_queue_depth tracks concurrent requests.

Deploying Prometheus and the Custom Metrics Adapter

Your cluster needs three things: Prometheus scraping the model server pods, the prometheus-adapter translating those metrics into the Kubernetes custom metrics API, and a ServiceMonitor telling Prometheus where to scrape.

First, deploy your model server. This Deployment and Service expose the metrics port for Prometheus:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# model-server-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server
  labels:
    app: model-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: model-server
  template:
    metadata:
      labels:
        app: model-server
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: model-server
        image: your-registry/model-server:latest
        ports:
        - containerPort: 8080
          name: http
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: model-server
  labels:
    app: model-server
spec:
  selector:
    app: model-server
  ports:
  - port: 80
    targetPort: 8080
    name: http

If you’re using the Prometheus Operator (installed via kube-prometheus-stack), create a ServiceMonitor:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: model-server-monitor
  labels:
    release: prometheus  # must match your Prometheus Operator's label selector
spec:
  selector:
    matchLabels:
      app: model-server
  endpoints:
  - port: http
    path: /metrics
    interval: 15s

Now install the prometheus-adapter. This bridges Prometheus metrics into the Kubernetes custom metrics API (custom.metrics.k8s.io), which the HPA reads from:

1
2
3
4
5
6
7
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-kube-prometheus-prometheus.monitoring.svc \
  --set prometheus.port=9090 \
  -f adapter-values.yaml

The adapter-values.yaml file configures which Prometheus metrics get exposed to Kubernetes and how they’re named:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# adapter-values.yaml
rules:
  custom:
  - seriesQuery: 'inference_latency_seconds_sum{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_sum$"
      as: "inference_latency_seconds_avg"
    metricsQuery: 'rate(inference_latency_seconds_sum{<<.LabelMatchers>>}[2m]) / rate(inference_latency_seconds_count{<<.LabelMatchers>>}[2m])'

  - seriesQuery: 'request_queue_depth{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)$"
      as: "${1}"
    metricsQuery: '<<.Series>>{<<.LabelMatchers>>}'

  - seriesQuery: 'gpu_utilization_percent{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)$"
      as: "${1}"
    metricsQuery: '<<.Series>>{<<.LabelMatchers>>}'

The first rule computes a rolling average inference latency from the histogram’s _sum and _count series. The other two pass through the gauge values directly.

Verify the custom metrics API is working:

1
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name'

You should see pods/inference_latency_seconds_avg, pods/request_queue_depth, and pods/gpu_utilization_percent in the output.

Configuring the Horizontal Pod Autoscaler

With custom metrics available in the Kubernetes API, the HPA can target them directly. This config scales when average inference latency exceeds 200ms or when any pod’s queue depth goes above 10:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120
  metrics:
  - type: Pods
    pods:
      metric:
        name: inference_latency_seconds_avg
      target:
        type: AverageValue
        averageValue: "200m"  # 200 milliseconds (Kubernetes uses suffix "m" for milli-units)
  - type: Pods
    pods:
      metric:
        name: request_queue_depth
      target:
        type: AverageValue
        averageValue: "10"

The behavior section matters. ML pods are slow to start – model loading, GPU warmup, and health checks can take 30-60 seconds. Setting a longer stabilizationWindowSeconds for scale-down prevents thrashing. For scale-up, the 30-second window lets the HPA react quickly when latency spikes.

Apply everything:

1
2
3
kubectl apply -f model-server-deployment.yaml
kubectl apply -f servicemonitor.yaml
kubectl apply -f hpa.yaml

Check the HPA status:

1
kubectl get hpa model-server-hpa -w

You’ll see the current and target values for each metric. If it shows <unknown> for the custom metrics, the prometheus-adapter isn’t finding the metrics yet – check the adapter logs and verify the ServiceMonitor is scraping correctly.

Testing the Autoscaler Under Load

Use hey to blast the prediction endpoint and watch pods scale up. Install it first:

1
go install github.com/rakyll/hey@latest

Generate sustained load with 50 concurrent workers for 2 minutes:

1
2
3
4
hey -z 120s -c 50 -m POST \
  -H "Content-Type: application/json" \
  -d '{"features": [0.1, 0.2, 0.3, 0.4, 0.5]}' \
  http://model-server.default.svc.cluster.local/predict

If you’re testing from outside the cluster, port-forward first:

1
2
3
4
5
kubectl port-forward svc/model-server 8080:80 &
hey -z 120s -c 50 -m POST \
  -H "Content-Type: application/json" \
  -d '{"features": [0.1, 0.2, 0.3, 0.4, 0.5]}' \
  http://localhost:8080/predict

In another terminal, watch the autoscaler respond:

1
kubectl get hpa model-server-hpa -w

You should see inference_latency_seconds_avg climb past 200m and request_queue_depth exceed 10. Within a minute, the HPA will start adding pods. After the load stops, it’ll wait the 5-minute stabilization window before scaling back down.

To see the scaling events:

1
kubectl describe hpa model-server-hpa

Look for the Events section at the bottom. It shows every scale-up and scale-down decision with timestamps and the metric values that triggered it.

Common Errors and Fixes

HPA shows <unknown> for custom metrics. The prometheus-adapter can’t find the metrics. Check three things: (1) the ServiceMonitor labels match your Prometheus Operator’s selector (release: prometheus is the default for kube-prometheus-stack), (2) the adapter’s seriesQuery matches the actual metric name in Prometheus, and (3) the adapter pod is running (kubectl get pods -n monitoring | grep adapter).

Metrics API returns empty results. Run kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/inference_latency_seconds_avg" to query the API directly. If it returns an empty items list, the adapter rules don’t match. Check the adapter logs: kubectl logs -n monitoring deploy/prometheus-adapter.

Pods scale up but never scale down. The stabilizationWindowSeconds for scale-down is set to 300 seconds (5 minutes) by design. But if pods never scale down even after traffic drops, check that the metrics actually decrease. A histogram sum only grows, so the rate() query in the adapter must divide sum rate by count rate correctly. Verify with a direct Prometheus query.

Scale-up is too slow for traffic spikes. Reduce the stabilizationWindowSeconds for scale-up to 0 and increase the value in the scale-up policy. You can also add a second HPA metric on CPU as a fast-reacting fallback alongside the custom metrics.

GPU utilization reads 0. The pynvml library needs the NVIDIA driver to be visible inside the container. Make sure your container runtime has GPU support configured (nvidia-container-runtime) and the NVIDIA_VISIBLE_DEVICES environment variable is set. On Kubernetes, the nvidia.com/gpu resource limit handles this automatically when using the NVIDIA device plugin.

Exposing Custom Metrics from Your Model Server#

Deploying Prometheus and the Custom Metrics Adapter#

Configuring the Horizontal Pod Autoscaler#

Testing the Autoscaler Under Load#

Common Errors and Fixes#

Related Guides#

About the Author

Exposing Custom Metrics from Your Model Server

Deploying Prometheus and the Custom Metrics Adapter

Configuring the Horizontal Pod Autoscaler

Testing the Autoscaler Under Load

Common Errors and Fixes

Related Guides