Why SGLang Over Other Serving Frameworks
SGLang has quietly become the default inference engine at shops running serious LLM workloads. It powers trillions of tokens per day across companies like xAI, Cursor, and LinkedIn. The reason is RadixAttention – a prefix caching system that reuses KV cache across requests sharing common prefixes. If you’re sending the same system prompt to every request (and you probably are), SGLang caches that work instead of recomputing it.
Compared to vLLM, SGLang consistently wins on time-to-first-token (TTFT) for workloads with shared prefixes. It also has a zero-overhead CPU scheduler, prefill-decode disaggregation, and native support for structured output generation. The current stable release is v0.5.8.
Install SGLang
The fastest path is uv (the Rust-based pip replacement):
| |
SGLang needs a GPU with compute capability sm75 or higher (T4, A10, A100, L4, L40S, H100, or newer). FlashInfer is the default attention backend and it won’t work on older cards.
Start the Server
Launch a model with a single command:
| |
The --mem-fraction-static flag controls what fraction of GPU memory goes to the KV cache pool. Default is 0.88. Drop it to 0.8 or 0.7 if you’re getting OOM errors during decoding.
Once the server prints “The server is fired up and ready to roll”, you’re live.
Send Requests with the OpenAI SDK
SGLang exposes an OpenAI-compatible API at /v1. Point the standard OpenAI Python client at it:
| |
Every framework that integrates with OpenAI’s API – LangChain, LlamaIndex, AutoGen, your custom code – works here with just a base_url change. No SDK swap needed.
Deploy with Docker
For production, use the official Docker images. The runtime variant strips out build tools and dev dependencies, cutting image size by roughly 40%.
| |
The --shm-size 32g is not optional. SGLang uses shared memory for inter-process communication and will fail silently or hang without it. The -v mount caches Hugging Face model weights on the host so restarts don’t re-download multi-gigabyte files.
Use the Native Generate API
Besides the OpenAI-compatible endpoints, SGLang has its own /generate endpoint that exposes more control:
| |
The meta_info field gives you prompt token count, completion token count, and latency breakdown – useful for building dashboards without bolting on a separate metrics layer.
Multi-GPU with Tensor Parallelism
For models that don’t fit on a single GPU, split them across cards:
| |
The --tp 2 flag shards the model across 2 GPUs using tensor parallelism. For a 70B model, you need at least 2x 80GB GPUs (A100 or H100). With 4x A100s, use --tp 4 for better throughput.
SGLang also supports expert parallelism (--ep) for MoE models like DeepSeek V3 and data parallelism (--dp) for scaling throughput across GPUs serving the same model.
Key Server Flags
| Flag | What It Does | Default |
|---|---|---|
--mem-fraction-static | GPU memory fraction for KV cache | 0.88 |
--tp | Tensor parallel GPUs | 1 |
--dp | Data parallel replicas | 1 |
--chunked-prefill-size | Chunk size for long prompt prefill | 8192 |
--max-running-requests | Max concurrent decoding requests | auto |
--quantization | Quantization (fp8, int4, awq, gptq) | None |
--disable-radix-cache | Turn off prefix caching | false |
--context-length | Override model’s default context length | Model default |
Health Checks
SGLang provides two health endpoints. Use the basic one for load balancer probes and the generate one for deeper validation:
| |
Troubleshooting Real Errors
CUDA Out of Memory During Prefill
You send a long prompt and the server crashes with an OOM error. This happens when the prefill phase tries to process the entire prompt at once.
Fix it by reducing the chunked prefill size:
| |
Dropping --chunked-prefill-size from the default 8192 to 4096 (or 2048 for really tight memory) makes SGLang process the prompt in smaller chunks. It’s slightly slower on long inputs but doesn’t blow up.
CUDA Error: Illegal Memory Access
This is a frustrating one. The error message says “illegal memory access” but the actual cause is often OOM masquerading as a kernel error. Try the OOM fixes first – lower --mem-fraction-static to 0.7 and reduce --max-running-requests. If it persists with plenty of free memory, it’s a genuine kernel bug and you should file an issue on the SGLang GitHub.
Server Hangs on Startup
If the server freezes during initialization, check your GPU memory. Run nvidia-smi in another terminal. If available memory is near zero, something else is using the GPU. Kill stale processes with:
| |
For multi-GPU setups, hangs often come from NCCL communication failures. Make sure --shm-size is set to at least 32g in Docker and --ipc=host is present.
Non-Deterministic Outputs at Temperature 0
Even with temperature=0, you might get slightly different outputs across identical requests. This is expected – SGLang’s dynamic batching (responsible for about 95% of the variance) and prefix caching (about 5%) introduce minor floating-point differences.
If you need bit-exact determinism for testing, disable both:
| |
This kills throughput, so only use it for evaluation and debugging – never in production.
SGLang vs vLLM: When to Pick Which
Use SGLang when your workload has shared prefixes (same system prompt, few-shot examples, or RAG context across requests). RadixAttention gives you a real advantage here. SGLang also handles structured output generation natively, which matters if you’re enforcing JSON schemas.
Use vLLM if your prompts are all unique with no shared prefixes, or if you need a specific feature that vLLM has and SGLang doesn’t (like particular quantization formats or model architectures).
Both expose OpenAI-compatible APIs, so switching between them is a one-line config change. Start with SGLang for most production chat and agent workloads – the prefix caching alone usually makes it worth it.