Why vLLM

Serving LLMs with naive PyTorch inference is painfully slow. You get one request at a time and the GPU sits idle between tokens. vLLM fixes this with PagedAttention — a memory management technique that lets you serve 10-24x more concurrent requests on the same hardware.

It exposes an OpenAI-compatible API, so your existing code that calls openai.ChatCompletion.create() works with zero changes. Just point it at your vLLM server instead of OpenAI.

Quick Start

1
pip install vllm

Start serving a model with one command:

1
2
3
4
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096

That downloads the model from Hugging Face, loads it onto your GPU, and starts an OpenAI-compatible API server. First run takes a few minutes to download; subsequent starts are instant.

Call It Like OpenAI

The API is drop-in compatible. Use the standard OpenAI Python client.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from openai import OpenAI

# Point at your vLLM server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM doesn't require an API key by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to flatten a nested list."},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

This is the killer feature — every tool and framework that works with the OpenAI API (LangChain, LlamaIndex, AutoGen) works with vLLM instantly. You just change the base_url.

Production Deployment with Docker

Don’t install vLLM directly on your production server. Use Docker with GPU passthrough.

1
2
3
4
5
FROM vllm/vllm-openai:latest

# The model will be downloaded on first start
# For faster startup, pre-download to a volume
ENV MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Run with GPU access
docker run -d \
  --gpus all \
  --name vllm-server \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 4096 \
  --tensor-parallel-size 1

The -v mount caches the model weights on the host so restarts don’t re-download. For multi-GPU setups, set --tensor-parallel-size to the number of GPUs.

Key Configuration Options

1
2
3
4
5
6
7
8
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \       # Max context length
  --gpu-memory-utilization 0.9 \ # Use 90% of GPU memory
  --tensor-parallel-size 2 \    # Split across 2 GPUs
  --max-num-seqs 256 \          # Max concurrent sequences
  --quantization awq             # Use AWQ quantized model

Flag	What It Does	Default
`--max-model-len`	Maximum context window	Model’s max
`--gpu-memory-utilization`	GPU memory fraction to use	0.9
`--tensor-parallel-size`	Number of GPUs for model parallelism	1
`--max-num-seqs`	Max concurrent requests	256
`--quantization`	Quantization method (awq, gptq, squeezellm)	None

Streaming Responses

For chat applications, stream tokens as they’re generated instead of waiting for the full response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention in 3 sentences."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Health Checks and Monitoring

vLLM exposes metrics for Prometheus out of the box.

1
2
3
4
5
# Check if the server is healthy
curl http://localhost:8000/health

# Get Prometheus metrics
curl http://localhost:8000/metrics

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Programmatic health check
import requests

def check_vllm_health(url: str = "http://localhost:8000") -> bool:
    try:
        resp = requests.get(f"{url}/health", timeout=5)
        return resp.status_code == 200
    except requests.ConnectionError:
        return False

# List available models
resp = requests.get("http://localhost:8000/v1/models")
print(resp.json())

Key metrics to watch:

vllm:num_requests_running — active requests
vllm:num_requests_waiting — queued requests
vllm:gpu_cache_usage_perc — KV cache utilization (if this hits 100%, requests get queued)

Common Issues

Out of GPU memory. Lower --gpu-memory-utilization to 0.8 or reduce --max-model-len. For 8B models, you need at least 16GB VRAM. Use --quantization awq to cut memory usage in half.

Slow first request. vLLM compiles CUDA kernels on the first request. This is a one-time cost per server start. Subsequent requests are fast.

Model not found. Make sure you have access to gated models (like Llama). Run huggingface-cli login first and accept the model’s license on the Hugging Face website.

1
2
3
# Login to Hugging Face for gated models
pip install huggingface_hub
huggingface-cli login

Choosing the Right Model Size

Model	VRAM Needed	Throughput	Quality
7-8B	16 GB	~50 tok/s	Good for simple tasks
13B	28 GB	~30 tok/s	Better reasoning
70B	140 GB (or 2x80GB)	~15 tok/s	Near GPT-4 quality
70B AWQ	40 GB	~25 tok/s	70B quality, less VRAM

Start with the smallest model that meets your quality bar, then scale up if needed. Quantized models (AWQ, GPTQ) give you bigger models on less hardware with minimal quality loss.

Why vLLM#

Quick Start#

Call It Like OpenAI#

Production Deployment with Docker#

Key Configuration Options#

Streaming Responses#

Health Checks and Monitoring#

Common Issues#

Choosing the Right Model Size#