Why vLLM
Serving LLMs with naive PyTorch inference is painfully slow. You get one request at a time and the GPU sits idle between tokens. vLLM fixes this with PagedAttention — a memory management technique that lets you serve 10-24x more concurrent requests on the same hardware.
It exposes an OpenAI-compatible API, so your existing code that calls openai.ChatCompletion.create() works with zero changes. Just point it at your vLLM server instead of OpenAI.
Quick Start
| |
Start serving a model with one command:
| |
That downloads the model from Hugging Face, loads it onto your GPU, and starts an OpenAI-compatible API server. First run takes a few minutes to download; subsequent starts are instant.
Call It Like OpenAI
The API is drop-in compatible. Use the standard OpenAI Python client.
| |
This is the killer feature — every tool and framework that works with the OpenAI API (LangChain, LlamaIndex, AutoGen) works with vLLM instantly. You just change the base_url.
Production Deployment with Docker
Don’t install vLLM directly on your production server. Use Docker with GPU passthrough.
| |
| |
The -v mount caches the model weights on the host so restarts don’t re-download. For multi-GPU setups, set --tensor-parallel-size to the number of GPUs.
Key Configuration Options
| |
| Flag | What It Does | Default |
|---|---|---|
--max-model-len | Maximum context window | Model’s max |
--gpu-memory-utilization | GPU memory fraction to use | 0.9 |
--tensor-parallel-size | Number of GPUs for model parallelism | 1 |
--max-num-seqs | Max concurrent requests | 256 |
--quantization | Quantization method (awq, gptq, squeezellm) | None |
Streaming Responses
For chat applications, stream tokens as they’re generated instead of waiting for the full response.
| |
Health Checks and Monitoring
vLLM exposes metrics for Prometheus out of the box.
| |
| |
Key metrics to watch:
vllm:num_requests_running— active requestsvllm:num_requests_waiting— queued requestsvllm:gpu_cache_usage_perc— KV cache utilization (if this hits 100%, requests get queued)
Common Issues
Out of GPU memory. Lower --gpu-memory-utilization to 0.8 or reduce --max-model-len. For 8B models, you need at least 16GB VRAM. Use --quantization awq to cut memory usage in half.
Slow first request. vLLM compiles CUDA kernels on the first request. This is a one-time cost per server start. Subsequent requests are fast.
Model not found. Make sure you have access to gated models (like Llama). Run huggingface-cli login first and accept the model’s license on the Hugging Face website.
| |
Choosing the Right Model Size
| Model | VRAM Needed | Throughput | Quality |
|---|---|---|---|
| 7-8B | 16 GB | ~50 tok/s | Good for simple tasks |
| 13B | 28 GB | ~30 tok/s | Better reasoning |
| 70B | 140 GB (or 2x80GB) | ~15 tok/s | Near GPT-4 quality |
| 70B AWQ | 40 GB | ~25 tok/s | 70B quality, less VRAM |
Start with the smallest model that meets your quality bar, then scale up if needed. Quantized models (AWQ, GPTQ) give you bigger models on less hardware with minimal quality loss.