Why vLLM
Serving LLMs with naive PyTorch inference is painfully slow. You get one request at a time and the GPU sits idle between tokens. vLLM fixes this with PagedAttention — a memory management technique that lets you serve 10-24x more concurrent requests on the same hardware.
It exposes an OpenAI-compatible API, so your existing code that calls openai.ChatCompletion.create() works with zero changes. Just point it at your vLLM server instead of OpenAI.
Quick Start
| |
Start serving a model with one command:
| |
That downloads the model from Hugging Face, loads it onto your GPU, and starts an OpenAI-compatible API server. First run takes a few minutes to download; subsequent starts are instant.
Call It Like OpenAI
The API is drop-in compatible. Use the standard OpenAI Python client.
| |
This is the killer feature — every tool and framework that works with the OpenAI API (LangChain, LlamaIndex, AutoGen) works with vLLM instantly. You just change the base_url.
Production Deployment with Docker
Don’t install vLLM directly on your production server. Use Docker with GPU passthrough.
| |
| |
The -v mount caches the model weights on the host so restarts don’t re-download. For multi-GPU setups, set --tensor-parallel-size to the number of GPUs.
Key Configuration Options
| |
| Flag | What It Does | Default |
|---|---|---|
--max-model-len | Maximum context window | Model’s max |
--gpu-memory-utilization | GPU memory fraction to use | 0.9 |
--tensor-parallel-size | Number of GPUs for model parallelism | 1 |
--max-num-seqs | Max concurrent requests | 256 |
--quantization | Quantization method (awq, gptq, squeezellm) | None |
Streaming Responses
For chat applications, stream tokens as they’re generated instead of waiting for the full response.
| |
Health Checks and Monitoring
vLLM exposes metrics for Prometheus out of the box.
| |
| |
Key metrics to watch:
vllm:num_requests_running— active requestsvllm:num_requests_waiting— queued requestsvllm:gpu_cache_usage_perc— KV cache utilization (if this hits 100%, requests get queued)
Common Issues
Out of GPU memory. Lower --gpu-memory-utilization to 0.8 or reduce --max-model-len. For 8B models, you need at least 16GB VRAM. Use --quantization awq to cut memory usage in half.
Slow first request. vLLM compiles CUDA kernels on the first request. This is a one-time cost per server start. Subsequent requests are fast.
Model not found. Make sure you have access to gated models (like Llama). Run huggingface-cli login first and accept the model’s license on the Hugging Face website.
| |
Choosing the Right Model Size
| Model | VRAM Needed | Throughput | Quality |
|---|---|---|---|
| 7-8B | 16 GB | ~50 tok/s | Good for simple tasks |
| 13B | 28 GB | ~30 tok/s | Better reasoning |
| 70B | 140 GB (or 2x80GB) | ~15 tok/s | Near GPT-4 quality |
| 70B AWQ | 40 GB | ~25 tok/s | 70B quality, less VRAM |
Start with the smallest model that meets your quality bar, then scale up if needed. Quantized models (AWQ, GPTQ) give you bigger models on less hardware with minimal quality loss.
Related Guides
- How to Serve LLMs in Production with SGLang
- How to Deploy LLMs to Production with Docker and FastAPI
- How to Autoscale LLM Inference on Kubernetes with KEDA
- How to Serve ML Models with BentoML and Build Prediction APIs
- How to Build a Model Serving Pipeline with Ray Serve and FastAPI
- How to Detect Model Drift and Data Drift in Production
- How to Route LLM Traffic by Cost and Complexity Using Intelligent Model Routing
- How to Build a Model Input Validation Pipeline with Pydantic and FastAPI
- How to Version and Deploy Models with MLflow Model Registry
- How to Build Blue-Green Deployments for ML Models