You have a model. You want it behind an API that handles real traffic. FastAPI gives you async request handling and automatic OpenAPI docs. Docker gives you reproducible deployments with GPU passthrough. Together they’re the fastest path from “it works on my machine” to a production inference endpoint.
This guide builds a complete LLM API server using llama-cpp-python for inference, FastAPI for the HTTP layer, and Docker with NVIDIA CUDA for containerized GPU support. Everything here is runnable – copy the files, build, and you have a working service.
The FastAPI Application#
Here’s the full application. It loads a GGUF model at startup, exposes a /v1/completions endpoint for single-shot requests, and a /v1/completions/stream endpoint for server-sent events.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
| # app/main.py
import os
import time
import asyncio
from contextlib import asynccontextmanager
from typing import AsyncGenerator
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import JSONResponse
from pydantic import BaseModel, Field
from sse_starlette.sse import EventSourceResponse
from llama_cpp import Llama
# --- Configuration ---
MODEL_PATH = os.getenv("MODEL_PATH", "/models/model.gguf")
N_GPU_LAYERS = int(os.getenv("N_GPU_LAYERS", "-1")) # -1 = offload all layers to GPU
N_CTX = int(os.getenv("N_CTX", "4096"))
MAX_CONCURRENT = int(os.getenv("MAX_CONCURRENT", "4"))
# Semaphore to limit concurrent inference calls
inference_semaphore = asyncio.Semaphore(MAX_CONCURRENT)
# Global model reference
llm: Llama | None = None
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Load the model once at startup, release on shutdown."""
global llm
print(f"Loading model from {MODEL_PATH}...")
start = time.time()
llm = Llama(
model_path=MODEL_PATH,
n_gpu_layers=N_GPU_LAYERS,
n_ctx=N_CTX,
verbose=False,
)
print(f"Model loaded in {time.time() - start:.1f}s")
yield
print("Shutting down, releasing model...")
del llm
app = FastAPI(title="LLM Inference API", lifespan=lifespan)
# --- Request / Response Models ---
class CompletionRequest(BaseModel):
prompt: str = Field(..., min_length=1, max_length=8192)
max_tokens: int = Field(default=512, ge=1, le=4096)
temperature: float = Field(default=0.7, ge=0.0, le=2.0)
stop: list[str] | None = None
class CompletionResponse(BaseModel):
text: str
tokens_used: int
generation_time_ms: float
# --- Endpoints ---
@app.get("/health")
async def health():
if llm is None:
raise HTTPException(status_code=503, detail="Model not loaded")
return {"status": "ok", "model": MODEL_PATH, "max_ctx": N_CTX}
@app.post("/v1/completions", response_model=CompletionResponse)
async def create_completion(req: CompletionRequest):
async with inference_semaphore:
start = time.time()
output = await asyncio.to_thread(
llm,
req.prompt,
max_tokens=req.max_tokens,
temperature=req.temperature,
stop=req.stop,
)
elapsed = (time.time() - start) * 1000
text = output["choices"][0]["text"]
tokens = output["usage"]["completion_tokens"]
return CompletionResponse(text=text, tokens_used=tokens, generation_time_ms=round(elapsed, 1))
@app.post("/v1/completions/stream")
async def create_completion_stream(req: CompletionRequest):
async def event_generator() -> AsyncGenerator[dict, None]:
async with inference_semaphore:
stream = llm(
req.prompt,
max_tokens=req.max_tokens,
temperature=req.temperature,
stop=req.stop,
stream=True,
)
for chunk in stream:
token = chunk["choices"][0]["text"]
if token:
yield {"data": token}
yield {"data": "[DONE]"}
return EventSourceResponse(event_generator(), media_type="text/event-stream")
|
A few design decisions worth noting. The model loads once during the lifespan context manager – not on every request. This matters because loading a 7B GGUF model takes 5-15 seconds depending on the GPU. The asyncio.Semaphore caps concurrent inference to prevent GPU OOM under load. And asyncio.to_thread offloads the blocking C++ inference call so FastAPI’s event loop stays responsive for health checks and other requests.
Dependencies#
1
2
3
4
5
6
| # app/requirements.txt
fastapi==0.115.6
uvicorn[standard]==0.34.0
sse-starlette==2.2.1
llama-cpp-python==0.3.6
pydantic==2.10.4
|
The Dockerfile#
Use a multi-stage build with NVIDIA’s CUDA base image. The build stage compiles llama-cpp-python with CUDA support, and the runtime stage keeps the final image lean.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| # Dockerfile
# === Build stage ===
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 python3-pip python3-dev build-essential cmake \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
COPY app/requirements.txt .
# Build llama-cpp-python with CUDA support
ENV CMAKE_ARGS="-DGGML_CUDA=on"
ENV FORCE_CMAKE=1
RUN pip3 install --no-cache-dir --prefix=/install -r requirements.txt
# === Runtime stage ===
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 python3-pip \
&& rm -rf /var/lib/apt/lists/*
# Copy installed Python packages from builder
COPY --from=builder /install /usr/local
WORKDIR /app
COPY app/ .
# Create model directory
RUN mkdir -p /models
# Non-root user for production
RUN useradd -m -s /bin/bash appuser && chown -R appuser:appuser /app /models
USER appuser
EXPOSE 8000
# Use exec form so signals propagate correctly for graceful shutdown
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
|
The devel image is only used for compilation – it includes the full CUDA toolkit and headers needed to build llama-cpp-python. The runtime image is much smaller and only has the CUDA runtime libraries. This cuts the final image size from ~8 GB to roughly ~4 GB.
One worker is intentional. LLM inference is GPU-bound, and multiple Uvicorn workers would each try to load the model into GPU memory separately. If you need horizontal scaling, run multiple containers behind a load balancer instead.
Docker Compose with GPU Support#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| # docker-compose.yml
services:
llm-api:
build: .
ports:
- "8000:8000"
volumes:
- ./models:/models:ro
environment:
- MODEL_PATH=/models/mistral-7b-instruct-v0.3.Q4_K_M.gguf
- N_GPU_LAYERS=-1
- N_CTX=4096
- MAX_CONCURRENT=4
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "python3", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
restart: unless-stopped
|
The start_period of 60 seconds gives the model time to load before Docker starts counting failed health checks. Without this, Docker restarts the container before the model finishes loading, and you end up in an infinite restart loop.
Download a model and start the service:
1
2
3
4
5
6
7
8
9
10
| # Download a quantized model (about 4 GB)
mkdir -p models
wget -O models/mistral-7b-instruct-v0.3.Q4_K_M.gguf \
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/mistral-7b-instruct-v0.3.Q4_K_M.gguf
# Build and start
docker compose up --build -d
# Watch the logs until you see "Model loaded"
docker compose logs -f llm-api
|
Production Hardening#
The bare FastAPI app works, but production traffic needs a few more layers. Add rate limiting, request timeouts, and structured logging with middleware.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| # app/middleware.py
import time
import asyncio
import logging
from collections import defaultdict
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware
logger = logging.getLogger("llm-api")
# Simple in-memory rate limiter (use Redis for multi-instance)
request_counts: dict[str, list[float]] = defaultdict(list)
RATE_LIMIT = int(os.getenv("RATE_LIMIT", "10")) # requests per minute
REQUEST_TIMEOUT = int(os.getenv("REQUEST_TIMEOUT", "120")) # seconds
class RateLimitMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
client_ip = request.client.host
now = time.time()
# Clean old entries and check rate
request_counts[client_ip] = [
t for t in request_counts[client_ip] if now - t < 60
]
if len(request_counts[client_ip]) >= RATE_LIMIT:
raise HTTPException(
status_code=429,
detail=f"Rate limit exceeded. Max {RATE_LIMIT} requests per minute.",
)
request_counts[client_ip].append(now)
# Enforce request timeout
try:
response = await asyncio.wait_for(
call_next(request), timeout=REQUEST_TIMEOUT
)
except asyncio.TimeoutError:
logger.warning(f"Request from {client_ip} timed out after {REQUEST_TIMEOUT}s")
raise HTTPException(status_code=504, detail="Request timed out")
return response
|
Register it in your app:
1
2
3
4
| # Add to app/main.py after creating the FastAPI instance
from middleware import RateLimitMiddleware
app.add_middleware(RateLimitMiddleware)
|
For multi-instance deployments, swap the in-memory dict for Redis or use an API gateway like Kong or Nginx with limit_req_zone for rate limiting at the infrastructure level.
Testing the API#
Once the container is running, hit it with curl:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # Health check
curl http://localhost:8000/health
# {"status":"ok","model":"/models/mistral-7b-instruct-v0.3.Q4_K_M.gguf","max_ctx":4096}
# Single completion
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a Python function that reverses a linked list:\n```python\n",
"max_tokens": 256,
"temperature": 0.3
}'
# Streaming response -- tokens arrive as SSE events
curl -N http://localhost:8000/v1/completions/stream \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain CUDA memory management in three sentences:", "max_tokens": 128}'
|
The -N flag on curl disables buffering so you see tokens as they arrive. Without it, curl waits for the entire response before printing anything.
Handling Concurrent Requests#
The semaphore in the application code is your first line of defense. But you also need to think about what happens when the queue fills up. Here’s how to add a proper request queue with backpressure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| # Replace the simple semaphore pattern in main.py with this
@app.post("/v1/completions", response_model=CompletionResponse)
async def create_completion(req: CompletionRequest):
try:
async with asyncio.timeout(5):
await inference_semaphore.acquire()
except TimeoutError:
raise HTTPException(
status_code=503,
detail="Server busy. All inference slots occupied. Retry in a few seconds.",
)
try:
start = time.time()
output = await asyncio.to_thread(
llm,
req.prompt,
max_tokens=req.max_tokens,
temperature=req.temperature,
stop=req.stop,
)
elapsed = (time.time() - start) * 1000
finally:
inference_semaphore.release()
text = output["choices"][0]["text"]
tokens = output["usage"]["completion_tokens"]
return CompletionResponse(text=text, tokens_used=tokens, generation_time_ms=round(elapsed, 1))
|
This gives clients a clear 503 with a retry hint instead of hanging indefinitely. Set MAX_CONCURRENT to match your GPU’s capacity – for a single GPU with a 7B Q4 model, 2-4 concurrent requests is typically the sweet spot before latency degrades.
Common Errors and Fixes#
CUDA out of memory on startup.
The model doesn’t fit in your GPU VRAM. A 7B model at Q4 quantization needs roughly 4-5 GB of VRAM. If you’re on a 6 GB card, reduce N_GPU_LAYERS to offload some layers to CPU instead of using -1 (all layers). Start with N_GPU_LAYERS=20 and increase until you hit OOM, then back off by 2.
1
2
| # Partial GPU offload -- 25 layers on GPU, rest on CPU
docker compose up -e N_GPU_LAYERS=25
|
RuntimeError: CUDA driver version is insufficient for CUDA runtime version.
Your host NVIDIA driver is too old for the CUDA version in the Docker image. Check your driver version with nvidia-smi and match it to the CUDA compatibility matrix. CUDA 12.4 needs driver 550+ on Linux. Upgrade the driver or use an older CUDA base image.
1
2
3
| # Check your driver version
nvidia-smi
# Look for "Driver Version: 550.xx" or higher
|
Container keeps restarting in a loop.
Usually the health check is failing before the model finishes loading. Increase start_period in your Docker Compose health check to give the model enough time. For 13B models, 120 seconds is safer. Check logs with docker compose logs llm-api to confirm.
torch.cuda.OutOfMemoryError during inference (not startup).
This hits when concurrent requests exceed GPU memory. Lower MAX_CONCURRENT from 4 to 2, or reduce N_CTX from 4096 to 2048. Longer context windows consume more KV cache memory per request.
Connection refused when calling the API.
The Uvicorn server binds to 0.0.0.0 inside the container, but if you’re on macOS with Docker Desktop, make sure the port mapping in docker-compose.yml is correct. Also verify the NVIDIA Container Toolkit is installed:
1
2
| # Verify GPU access inside Docker
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi
|
If that command fails, install the NVIDIA Container Toolkit:
1
2
3
4
5
6
7
8
| # Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
|
Slow first request after startup.
This is normal. The first inference call triggers CUDA kernel compilation and memory allocation. Subsequent requests are much faster. Add a warm-up call in the lifespan handler to eat this cost at startup rather than on the first real request:
1
2
3
| # Add at the end of the lifespan startup, before yield
_ = llm("warmup", max_tokens=1)
print("Warm-up complete")
|