The default Kubernetes Horizontal Pod Autoscaler (HPA) scales on CPU and memory. That works fine for web servers, but ML model serving has a different bottleneck profile. A pod can sit at 30% CPU while inference latency spikes because the GPU is saturated, the request queue is backing up, or batch sizes are too large. You need to scale on metrics that actually reflect serving performance – inference latency, queue depth, and GPU utilization.
This guide builds a complete custom metrics autoscaling pipeline: a FastAPI model server that exposes Prometheus metrics, a prometheus-adapter that makes those metrics available to the Kubernetes API, and an HPA that scales pods based on real inference behavior.
Exposing Custom Metrics from Your Model Server#
The model server needs to export three Prometheus metrics: a histogram for inference latency, a gauge for current queue depth, and a gauge for GPU utilization. The prometheus_client library handles the Prometheus exposition format, and you serve the /metrics endpoint alongside your prediction API.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
| # server.py
import asyncio
import time
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager
import torch
from fastapi import FastAPI, Request
from prometheus_client import (
CONTENT_TYPE_LATEST,
CollectorRegistry,
Gauge,
Histogram,
generate_latest,
)
from pydantic import BaseModel
from starlette.responses import Response
# Create a custom registry to avoid default process/platform metrics clutter
registry = CollectorRegistry()
INFERENCE_LATENCY = Histogram(
"inference_latency_seconds",
"Time spent on model inference",
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0],
registry=registry,
)
REQUEST_QUEUE_DEPTH = Gauge(
"request_queue_depth",
"Number of inference requests currently waiting or in-flight",
registry=registry,
)
GPU_UTILIZATION = Gauge(
"gpu_utilization_percent",
"Current GPU utilization percentage",
registry=registry,
)
model_state: dict = {}
def get_gpu_utilization() -> float:
"""Read GPU utilization from nvidia-smi via pynvml, fallback to 0."""
try:
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
pynvml.nvmlShutdown()
return float(util.gpu)
except Exception:
return 0.0
async def update_gpu_metrics() -> None:
"""Background task that polls GPU utilization every 5 seconds."""
while True:
GPU_UTILIZATION.set(get_gpu_utilization())
await asyncio.sleep(5)
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
# Load model at startup
print("Loading model...")
model_state["model"] = torch.nn.Linear(768, 2)
model_state["model"].eval()
# Start background GPU metrics polling
task = asyncio.create_task(update_gpu_metrics())
yield
task.cancel()
model_state.clear()
print("Model unloaded")
app = FastAPI(lifespan=lifespan)
class PredictRequest(BaseModel):
features: list[float]
class PredictResponse(BaseModel):
prediction: list[float]
latency_ms: float
@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
REQUEST_QUEUE_DEPTH.inc()
try:
start = time.perf_counter()
tensor_input = torch.tensor([req.features[:768]], dtype=torch.float32)
# Pad if input is shorter than expected
if tensor_input.shape[1] < 768:
padding = torch.zeros(1, 768 - tensor_input.shape[1])
tensor_input = torch.cat([tensor_input, padding], dim=1)
with torch.no_grad():
output = model_state["model"](tensor_input)
elapsed = time.perf_counter() - start
INFERENCE_LATENCY.observe(elapsed)
return PredictResponse(
prediction=output[0].tolist(),
latency_ms=round(elapsed * 1000, 2),
)
finally:
REQUEST_QUEUE_DEPTH.dec()
@app.get("/metrics")
async def metrics():
return Response(
content=generate_latest(registry),
media_type=CONTENT_TYPE_LATEST,
)
@app.get("/health")
async def health():
if "model" not in model_state:
return Response(status_code=503, content="model not loaded")
return {"status": "healthy"}
|
Install the dependencies:
1
| pip install fastapi uvicorn prometheus_client torch pynvml pydantic
|
Run locally to verify metrics work:
1
2
3
| uvicorn server:app --host 0.0.0.0 --port 8080
# In another terminal:
curl -s http://localhost:8080/metrics | grep inference_latency
|
You should see the histogram buckets and counts. After a few /predict requests, the inference_latency_seconds histogram populates and request_queue_depth tracks concurrent requests.
Deploying Prometheus and the Custom Metrics Adapter#
Your cluster needs three things: Prometheus scraping the model server pods, the prometheus-adapter translating those metrics into the Kubernetes custom metrics API, and a ServiceMonitor telling Prometheus where to scrape.
First, deploy your model server. This Deployment and Service expose the metrics port for Prometheus:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
| # model-server-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server
labels:
app: model-server
spec:
replicas: 2
selector:
matchLabels:
app: model-server
template:
metadata:
labels:
app: model-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: model-server
image: your-registry/model-server:latest
ports:
- containerPort: 8080
name: http
resources:
requests:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: model-server
labels:
app: model-server
spec:
selector:
app: model-server
ports:
- port: 80
targetPort: 8080
name: http
|
If you’re using the Prometheus Operator (installed via kube-prometheus-stack), create a ServiceMonitor:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: model-server-monitor
labels:
release: prometheus # must match your Prometheus Operator's label selector
spec:
selector:
matchLabels:
app: model-server
endpoints:
- port: http
path: /metrics
interval: 15s
|
Now install the prometheus-adapter. This bridges Prometheus metrics into the Kubernetes custom metrics API (custom.metrics.k8s.io), which the HPA reads from:
1
2
3
4
5
6
7
| helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus-kube-prometheus-prometheus.monitoring.svc \
--set prometheus.port=9090 \
-f adapter-values.yaml
|
The adapter-values.yaml file configures which Prometheus metrics get exposed to Kubernetes and how they’re named:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| # adapter-values.yaml
rules:
custom:
- seriesQuery: 'inference_latency_seconds_sum{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_sum$"
as: "inference_latency_seconds_avg"
metricsQuery: 'rate(inference_latency_seconds_sum{<<.LabelMatchers>>}[2m]) / rate(inference_latency_seconds_count{<<.LabelMatchers>>}[2m])'
- seriesQuery: 'request_queue_depth{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "${1}"
metricsQuery: '<<.Series>>{<<.LabelMatchers>>}'
- seriesQuery: 'gpu_utilization_percent{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)$"
as: "${1}"
metricsQuery: '<<.Series>>{<<.LabelMatchers>>}'
|
The first rule computes a rolling average inference latency from the histogram’s _sum and _count series. The other two pass through the gauge values directly.
Verify the custom metrics API is working:
1
| kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name'
|
You should see pods/inference_latency_seconds_avg, pods/request_queue_depth, and pods/gpu_utilization_percent in the output.
Configuring the Horizontal Pod Autoscaler#
With custom metrics available in the Kubernetes API, the HPA can target them directly. This config scales when average inference latency exceeds 200ms or when any pod’s queue depth goes above 10:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| # hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-server
minReplicas: 2
maxReplicas: 10
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
metrics:
- type: Pods
pods:
metric:
name: inference_latency_seconds_avg
target:
type: AverageValue
averageValue: "200m" # 200 milliseconds (Kubernetes uses suffix "m" for milli-units)
- type: Pods
pods:
metric:
name: request_queue_depth
target:
type: AverageValue
averageValue: "10"
|
The behavior section matters. ML pods are slow to start – model loading, GPU warmup, and health checks can take 30-60 seconds. Setting a longer stabilizationWindowSeconds for scale-down prevents thrashing. For scale-up, the 30-second window lets the HPA react quickly when latency spikes.
Apply everything:
1
2
3
| kubectl apply -f model-server-deployment.yaml
kubectl apply -f servicemonitor.yaml
kubectl apply -f hpa.yaml
|
Check the HPA status:
1
| kubectl get hpa model-server-hpa -w
|
You’ll see the current and target values for each metric. If it shows <unknown> for the custom metrics, the prometheus-adapter isn’t finding the metrics yet – check the adapter logs and verify the ServiceMonitor is scraping correctly.
Testing the Autoscaler Under Load#
Use hey to blast the prediction endpoint and watch pods scale up. Install it first:
1
| go install github.com/rakyll/hey@latest
|
Generate sustained load with 50 concurrent workers for 2 minutes:
1
2
3
4
| hey -z 120s -c 50 -m POST \
-H "Content-Type: application/json" \
-d '{"features": [0.1, 0.2, 0.3, 0.4, 0.5]}' \
http://model-server.default.svc.cluster.local/predict
|
If you’re testing from outside the cluster, port-forward first:
1
2
3
4
5
| kubectl port-forward svc/model-server 8080:80 &
hey -z 120s -c 50 -m POST \
-H "Content-Type: application/json" \
-d '{"features": [0.1, 0.2, 0.3, 0.4, 0.5]}' \
http://localhost:8080/predict
|
In another terminal, watch the autoscaler respond:
1
| kubectl get hpa model-server-hpa -w
|
You should see inference_latency_seconds_avg climb past 200m and request_queue_depth exceed 10. Within a minute, the HPA will start adding pods. After the load stops, it’ll wait the 5-minute stabilization window before scaling back down.
To see the scaling events:
1
| kubectl describe hpa model-server-hpa
|
Look for the Events section at the bottom. It shows every scale-up and scale-down decision with timestamps and the metric values that triggered it.
Common Errors and Fixes#
HPA shows <unknown> for custom metrics. The prometheus-adapter can’t find the metrics. Check three things: (1) the ServiceMonitor labels match your Prometheus Operator’s selector (release: prometheus is the default for kube-prometheus-stack), (2) the adapter’s seriesQuery matches the actual metric name in Prometheus, and (3) the adapter pod is running (kubectl get pods -n monitoring | grep adapter).
Metrics API returns empty results. Run kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/inference_latency_seconds_avg" to query the API directly. If it returns an empty items list, the adapter rules don’t match. Check the adapter logs: kubectl logs -n monitoring deploy/prometheus-adapter.
Pods scale up but never scale down. The stabilizationWindowSeconds for scale-down is set to 300 seconds (5 minutes) by design. But if pods never scale down even after traffic drops, check that the metrics actually decrease. A histogram sum only grows, so the rate() query in the adapter must divide sum rate by count rate correctly. Verify with a direct Prometheus query.
Scale-up is too slow for traffic spikes. Reduce the stabilizationWindowSeconds for scale-up to 0 and increase the value in the scale-up policy. You can also add a second HPA metric on CPU as a fast-reacting fallback alongside the custom metrics.
GPU utilization reads 0. The pynvml library needs the NVIDIA driver to be visible inside the container. Make sure your container runtime has GPU support configured (nvidia-container-runtime) and the NVIDIA_VISIBLE_DEVICES environment variable is set. On Kubernetes, the nvidia.com/gpu resource limit handles this automatically when using the NVIDIA device plugin.