Why SGLang Over Other Serving Frameworks

SGLang has quietly become the default inference engine at shops running serious LLM workloads. It powers trillions of tokens per day across companies like xAI, Cursor, and LinkedIn. The reason is RadixAttention – a prefix caching system that reuses KV cache across requests sharing common prefixes. If you’re sending the same system prompt to every request (and you probably are), SGLang caches that work instead of recomputing it.

Compared to vLLM, SGLang consistently wins on time-to-first-token (TTFT) for workloads with shared prefixes. It also has a zero-overhead CPU scheduler, prefill-decode disaggregation, and native support for structured output generation. The current stable release is v0.5.8.

Install SGLang

The fastest path is uv (the Rust-based pip replacement):

1
2
3
pip install --upgrade pip
pip install uv
uv pip install sglang

SGLang needs a GPU with compute capability sm75 or higher (T4, A10, A100, L4, L40S, H100, or newer). FlashInfer is the default attention backend and it won’t work on older cards.

Start the Server

Launch a model with a single command:

1
2
3
4
5
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --mem-fraction-static 0.85

The --mem-fraction-static flag controls what fraction of GPU memory goes to the KV cache pool. Default is 0.88. Drop it to 0.8 or 0.7 if you’re getting OOM errors during decoding.

Once the server prints “The server is fired up and ready to roll”, you’re live.

Send Requests with the OpenAI SDK

SGLang exposes an OpenAI-compatible API at /v1. Point the standard OpenAI Python client at it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY",  # SGLang doesn't enforce API keys by default
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "Explain RadixAttention in two sentences."},
    ],
    temperature=0.2,
    max_tokens=256,
)

print(response.choices[0].message.content)

Every framework that integrates with OpenAI’s API – LangChain, LlamaIndex, AutoGen, your custom code – works here with just a base_url change. No SDK swap needed.

Deploy with Docker

For production, use the official Docker images. The runtime variant strips out build tools and dev dependencies, cutting image size by roughly 40%.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
docker run -d \
  --gpus all \
  --shm-size 32g \
  --ipc=host \
  --name sglang-server \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=${HF_TOKEN}" \
  lmsysorg/sglang:latest-runtime \
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

The --shm-size 32g is not optional. SGLang uses shared memory for inter-process communication and will fail silently or hang without it. The -v mount caches Hugging Face model weights on the host so restarts don’t re-download multi-gigabyte files.

Use the Native Generate API

Besides the OpenAI-compatible endpoints, SGLang has its own /generate endpoint that exposes more control:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The three most important things about deploying LLMs are",
        "sampling_params": {
            "temperature": 0.7,
            "max_new_tokens": 128,
            "top_p": 0.95,
        },
    },
)

result = response.json()
print(result["text"])         # generated text
print(result["meta_info"])    # token counts, timing, finish reason

The meta_info field gives you prompt token count, completion token count, and latency breakdown – useful for building dashboards without bolting on a separate metrics layer.

Multi-GPU with Tensor Parallelism

For models that don’t fit on a single GPU, split them across cards:

1
2
3
4
5
6
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --tp 2 \
  --mem-fraction-static 0.85

The --tp 2 flag shards the model across 2 GPUs using tensor parallelism. For a 70B model, you need at least 2x 80GB GPUs (A100 or H100). With 4x A100s, use --tp 4 for better throughput.

SGLang also supports expert parallelism (--ep) for MoE models like DeepSeek V3 and data parallelism (--dp) for scaling throughput across GPUs serving the same model.

Key Server Flags

Flag	What It Does	Default
`--mem-fraction-static`	GPU memory fraction for KV cache	0.88
`--tp`	Tensor parallel GPUs	1
`--dp`	Data parallel replicas	1
`--chunked-prefill-size`	Chunk size for long prompt prefill	8192
`--max-running-requests`	Max concurrent decoding requests	auto
`--quantization`	Quantization (fp8, int4, awq, gptq)	None
`--disable-radix-cache`	Turn off prefix caching	false
`--context-length`	Override model’s default context length	Model default

Health Checks

SGLang provides two health endpoints. Use the basic one for load balancer probes and the generate one for deeper validation:

1
2
3
4
5
# Fast liveness check
curl http://localhost:30000/health

# Generates one token to verify the full pipeline works
curl http://localhost:30000/health_generate

Troubleshooting Real Errors

CUDA Out of Memory During Prefill

You send a long prompt and the server crashes with an OOM error. This happens when the prefill phase tries to process the entire prompt at once.

Fix it by reducing the chunked prefill size:

1
2
3
4
5
6
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --chunked-prefill-size 4096 \
  --mem-fraction-static 0.8

Dropping --chunked-prefill-size from the default 8192 to 4096 (or 2048 for really tight memory) makes SGLang process the prompt in smaller chunks. It’s slightly slower on long inputs but doesn’t blow up.

CUDA Error: Illegal Memory Access

This is a frustrating one. The error message says “illegal memory access” but the actual cause is often OOM masquerading as a kernel error. Try the OOM fixes first – lower --mem-fraction-static to 0.7 and reduce --max-running-requests. If it persists with plenty of free memory, it’s a genuine kernel bug and you should file an issue on the SGLang GitHub.

Server Hangs on Startup

If the server freezes during initialization, check your GPU memory. Run nvidia-smi in another terminal. If available memory is near zero, something else is using the GPU. Kill stale processes with:

1
2
# Find and kill zombie GPU processes
nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -I {} kill -9 {}

For multi-GPU setups, hangs often come from NCCL communication failures. Make sure --shm-size is set to at least 32g in Docker and --ipc=host is present.

Non-Deterministic Outputs at Temperature 0

Even with temperature=0, you might get slightly different outputs across identical requests. This is expected – SGLang’s dynamic batching (responsible for about 95% of the variance) and prefix caching (about 5%) introduce minor floating-point differences.

If you need bit-exact determinism for testing, disable both:

1
2
3
4
5
6
python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 30000 \
  --disable-radix-cache \
  --enable-deterministic-inference

This kills throughput, so only use it for evaluation and debugging – never in production.

SGLang vs vLLM: When to Pick Which

Use SGLang when your workload has shared prefixes (same system prompt, few-shot examples, or RAG context across requests). RadixAttention gives you a real advantage here. SGLang also handles structured output generation natively, which matters if you’re enforcing JSON schemas.

Use vLLM if your prompts are all unique with no shared prefixes, or if you need a specific feature that vLLM has and SGLang doesn’t (like particular quantization formats or model architectures).

Both expose OpenAI-compatible APIs, so switching between them is a one-line config change. Start with SGLang for most production chat and agent workloads – the prefix caching alone usually makes it worth it.

Why SGLang Over Other Serving Frameworks#

Install SGLang#

Start the Server#

Send Requests with the OpenAI SDK#

Deploy with Docker#

Use the Native Generate API#

Multi-GPU with Tensor Parallelism#

Key Server Flags#

Health Checks#

Troubleshooting Real Errors#

CUDA Out of Memory During Prefill#

CUDA Error: Illegal Memory Access#

Server Hangs on Startup#

Non-Deterministic Outputs at Temperature 0#

SGLang vs vLLM: When to Pick Which#