The Quick Version
LLM inference is expensive. Running GPU pods 24/7 when traffic is bursty wastes money. KEDA (Kubernetes Event-Driven Autoscaling) scales your LLM serving pods based on actual demand — queue depth, request rate, or custom metrics from your inference server.
| |
| |
Apply it with kubectl apply -f vllm-deployment.yaml. This gives you a single vLLM pod serving Llama 3.1 8B. Now let’s make it autoscale.
Scaling on Prometheus Metrics
vLLM exposes Prometheus metrics at /metrics. KEDA can read these and scale based on request queue depth — the number of requests waiting to be processed.
First, set up a ServiceMonitor so Prometheus scrapes vLLM:
| |
Then create a KEDA ScaledObject that watches the pending requests metric:
| |
Apply both files. Now when the pending request queue exceeds 5, KEDA adds pods. When the queue drains and stays empty for 5 minutes (cooldownPeriod), it scales back down.
Scaling to Zero with Activation
For development or low-traffic endpoints, you can scale to zero to save GPU costs entirely:
| |
The catch: cold starts for LLM pods are slow. Loading a 7B model takes 30-60 seconds. Pair scale-to-zero with a request queue (like RabbitMQ or Redis) that buffers requests while the pod spins up.
Queue-Based Scaling with Redis
For async inference workloads, scale based on queue length instead of live metrics. This works well when clients submit requests and poll for results.
| |
On the application side, push requests to Redis and have workers pull from it:
| |
Queue-based scaling is more predictable than metric-based scaling. Each pod processes N requests per minute, so KEDA can calculate exactly how many pods are needed.
GPU-Aware Scheduling
Kubernetes doesn’t know that loading an LLM takes 60 seconds. Without proper configuration, the scheduler might route traffic to pods that aren’t ready, causing timeouts.
Add readiness probes and startup probes:
| |
The startupProbe gives the pod up to 150 seconds (30 + 12*10) to load the model before Kubernetes considers it failed. The readinessProbe ensures traffic only routes to pods that have the model loaded and are ready to serve.
Common Errors and Fixes
Pods stuck in Pending state
The cluster has no available GPU nodes. Check kubectl describe pod <name> — look for “Insufficient nvidia.com/gpu”. Either add GPU nodes to your cluster or reduce maxReplicaCount.
KEDA doesn’t scale up
Verify the Prometheus query returns data: curl "http://prometheus:9090/api/v1/query?query=sum(vllm:num_requests_waiting)". If empty, check that the ServiceMonitor is scraping correctly.
Scale-down is too aggressive
Increase cooldownPeriod to 600-900 seconds. LLM pods are expensive to restart, so it’s cheaper to keep a warm pod idle for 10 minutes than to reload the model every time traffic dips.
Out of memory during scale-up
When multiple pods start simultaneously, they all try to load the model into GPU memory at once. Use maxReplicaCount conservatively and set spec.strategy.rollingUpdate.maxSurge: 1 to limit concurrent pod creation.
Uneven request distribution across pods
Kubernetes default round-robin doesn’t account for LLM request duration. Long generation requests make some pods busier than others. Use sessionAffinity: None and consider a custom load balancer that routes based on pending request count.
Related Guides
- How to Serve LLMs in Production with vLLM
- How to Implement Canary Deployments for ML Models
- How to Load Test and Benchmark LLM APIs with Locust
- How to Serve LLMs in Production with SGLang
- How to Route LLM Traffic by Cost and Complexity Using Intelligent Model Routing
- How to Build a Model Batch Inference Pipeline with Ray and Parquet
- How to A/B Test LLM Prompts and Models in Production
- How to Monitor LLM Apps with LangSmith
- How to Build a Model Rollback Pipeline with Health Checks
- How to Build a Model Serving Pipeline with Ray Serve and FastAPI