How to Autoscale LLM Inference on Kubernetes with KEDA

The Quick Version

LLM inference is expensive. Running GPU pods 24/7 when traffic is bursty wastes money. KEDA (Kubernetes Event-Driven Autoscaling) scales your LLM serving pods based on actual demand — queue depth, request rate, or custom metrics from your inference server.

1
2
3
4
# Install KEDA on your cluster
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--max-model-len"
          - "4096"
          - "--gpu-memory-utilization"
          - "0.9"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-server
  ports:
  - port: 8000
    targetPort: 8000

Apply it with kubectl apply -f vllm-deployment.yaml. This gives you a single vLLM pod serving Llama 3.1 8B. Now let’s make it autoscale.

Scaling on Prometheus Metrics

vLLM exposes Prometheus metrics at /metrics. KEDA can read these and scale based on request queue depth — the number of requests waiting to be processed.

First, set up a ServiceMonitor so Prometheus scrapes vLLM:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-monitor
spec:
  selector:
    matchLabels:
      app: vllm-server
  endpoints:
  - port: "8000"
    path: /metrics
    interval: 15s

Then create a KEDA ScaledObject that watches the pending requests metric:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaler
spec:
  scaleTargetRef:
    name: vllm-server
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300        # wait 5 min before scaling down
  pollingInterval: 15
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_pending_requests
      query: sum(vllm:num_requests_waiting{namespace="default"})
      threshold: "5"          # scale up when >5 pending requests
      activationThreshold: "1"  # scale from 0→1 when any request waiting

Apply both files. Now when the pending request queue exceeds 5, KEDA adds pods. When the queue drains and stays empty for 5 minutes (cooldownPeriod), it scales back down.

Scaling to Zero with Activation

For development or low-traffic endpoints, you can scale to zero to save GPU costs entirely:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scale-to-zero
spec:
  scaleTargetRef:
    name: vllm-server
  minReplicaCount: 0          # allow scale to zero
  maxReplicaCount: 4
  cooldownPeriod: 600         # 10 min idle before scaling to zero
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_active_requests
      query: sum(rate(vllm:num_requests_running{namespace="default"}[2m]))
      threshold: "1"
      activationThreshold: "0"  # any traffic activates

The catch: cold starts for LLM pods are slow. Loading a 7B model takes 30-60 seconds. Pair scale-to-zero with a request queue (like RabbitMQ or Redis) that buffers requests while the pod spins up.

Queue-Based Scaling with Redis

For async inference workloads, scale based on queue length instead of live metrics. This works well when clients submit requests and poll for results.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# keda-redis-scaler.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-queue-scaler
spec:
  scaleTargetRef:
    name: vllm-worker
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: redis
    metadata:
      address: redis-service.default:6379
      listName: inference_queue
      listLength: "10"        # 1 pod per 10 queued requests
      activationListLength: "1"

On the application side, push requests to Redis and have workers pull from it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import redis
import json
from vllm import LLM, SamplingParams

r = redis.Redis(host="redis-service", port=6379)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

while True:
    # Block until a request arrives
    _, payload = r.blpop("inference_queue")
    request = json.loads(payload)

    outputs = llm.generate(
        [request["prompt"]],
        SamplingParams(max_tokens=request.get("max_tokens", 512)),
    )
    result = outputs[0].outputs[0].text

    # Store result for client to fetch
    r.set(f"result:{request['id']}", json.dumps({"text": result}), ex=3600)

Queue-based scaling is more predictable than metric-based scaling. Each pod processes N requests per minute, so KEDA can calculate exactly how many pods are needed.

GPU-Aware Scheduling

Kubernetes doesn’t know that loading an LLM takes 60 seconds. Without proper configuration, the scheduler might route traffic to pods that aren’t ready, causing timeouts.

Add readiness probes and startup probes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
containers:
- name: vllm
  # ... image, args, etc.
  startupProbe:
    httpGet:
      path: /health
      port: 8000
    initialDelaySeconds: 30
    periodSeconds: 10
    failureThreshold: 12     # allow up to 2 min for model loading
  readinessProbe:
    httpGet:
      path: /health
      port: 8000
    periodSeconds: 5
    failureThreshold: 3
  livenessProbe:
    httpGet:
      path: /health
      port: 8000
    periodSeconds: 30
    failureThreshold: 3

The startupProbe gives the pod up to 150 seconds (30 + 12*10) to load the model before Kubernetes considers it failed. The readinessProbe ensures traffic only routes to pods that have the model loaded and are ready to serve.

Common Errors and Fixes

Pods stuck in Pending state

The cluster has no available GPU nodes. Check kubectl describe pod <name> — look for “Insufficient nvidia.com/gpu”. Either add GPU nodes to your cluster or reduce maxReplicaCount.

KEDA doesn’t scale up

Verify the Prometheus query returns data: curl "http://prometheus:9090/api/v1/query?query=sum(vllm:num_requests_waiting)". If empty, check that the ServiceMonitor is scraping correctly.

Scale-down is too aggressive

Increase cooldownPeriod to 600-900 seconds. LLM pods are expensive to restart, so it’s cheaper to keep a warm pod idle for 10 minutes than to reload the model every time traffic dips.

Out of memory during scale-up

When multiple pods start simultaneously, they all try to load the model into GPU memory at once. Use maxReplicaCount conservatively and set spec.strategy.rollingUpdate.maxSurge: 1 to limit concurrent pod creation.

Uneven request distribution across pods

Kubernetes default round-robin doesn’t account for LLM request duration. Long generation requests make some pods busier than others. Use sessionAffinity: None and consider a custom load balancer that routes based on pending request count.

The Quick Version#

Scaling on Prometheus Metrics#

Scaling to Zero with Activation#

Queue-Based Scaling with Redis#

GPU-Aware Scheduling#

Common Errors and Fixes#

Related Guides#

About the Author