LitServe is Lightning AI’s framework for turning ML models into production APIs. It wraps FastAPI with features you actually need for inference: automatic batching, GPU device management, and multi-worker scaling. You write a Python class, define four methods, and LitServe handles the rest.

The framework claims 2x faster throughput than plain FastAPI for AI workloads, and from testing, that holds up once you enable batching. Here’s how to build a complete serving pipeline from scratch.

Install Dependencies

You need three packages: litserve for the server, transformers for the model, and torch as the backend.

1
pip install litserve transformers torch

LitServe requires Python 3.10 or higher. As of this writing, the latest version is 0.2.17.

Define the LitAPI Class

LitServe’s core abstraction is the LitAPI class. You subclass it and implement four methods:

  • setup(device) – loads your model once at server startup
  • decode_request(request) – extracts input data from the incoming JSON
  • predict(x) – runs inference on the decoded input
  • encode_response(output) – formats the model output as a JSON response

Here’s a complete server that serves a HuggingFace text classification model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# server.py
import litserve as ls
from transformers import pipeline


class TextClassificationAPI(ls.LitAPI):
    def setup(self, device):
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=device,
        )

    def decode_request(self, request):
        return request["text"]

    def predict(self, text):
        result = self.classifier(text)
        return result[0]

    def encode_response(self, output):
        return {
            "label": output["label"],
            "score": round(output["score"], 4),
        }


if __name__ == "__main__":
    api = TextClassificationAPI()
    server = ls.LitServer(api, accelerator="auto")
    server.run(port=8000)

A few things to note:

  • The device parameter in setup is managed by LitServe. When you set accelerator="auto", it picks cuda if a GPU is available, otherwise falls back to cpu.
  • The HuggingFace pipeline accepts a device argument directly, so you pass it straight through.
  • predict receives whatever decode_request returns, not the raw HTTP request.

Start the server:

1
python server.py

You’ll see FastAPI/Uvicorn output with the server running on http://localhost:8000.

Test with a Client

Send a POST request to the /predict endpoint:

1
2
3
4
5
6
7
8
9
# client.py
import requests

response = requests.post(
    "http://localhost:8000/predict",
    json={"text": "This movie was absolutely fantastic and exceeded all my expectations."},
)
print(response.json())
# {"label": "POSITIVE", "score": 0.9999}

Or use curl:

1
2
3
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "This movie was absolutely fantastic."}'

Enable Batching for Higher Throughput

Batching groups multiple incoming requests and processes them together in a single forward pass. This is where GPU utilization actually improves – a single request barely scratches the surface of what a GPU can handle.

Pass max_batch_size to the LitAPI constructor:

1
2
3
4
if __name__ == "__main__":
    api = TextClassificationAPI(max_batch_size=8)
    server = ls.LitServer(api, accelerator="auto")
    server.run(port=8000)

When batching is enabled, LitServe collects up to 8 requests and sends them through predict as a batch. The pipeline from HuggingFace already handles list inputs, so you need to adjust your API class to process batches:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class BatchedTextClassificationAPI(ls.LitAPI):
    def setup(self, device):
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2-english",
            device=device,
        )

    def decode_request(self, request):
        return request["text"]

    def predict(self, batch):
        # batch is a list of decoded inputs when batching is enabled
        results = self.classifier(batch)
        return results

    def encode_response(self, output):
        return {
            "label": output["label"],
            "score": round(output["score"], 4),
        }


if __name__ == "__main__":
    api = BatchedTextClassificationAPI(max_batch_size=8)
    server = ls.LitServer(api, accelerator="auto")
    server.run(port=8000)

When max_batch_size is set, predict receives a list of decoded inputs instead of a single value. LitServe handles splitting the batch response back to individual clients automatically – each encode_response call receives one element from the list returned by predict.

You can also set a batch timeout to avoid waiting forever for a full batch:

1
api = BatchedTextClassificationAPI(max_batch_size=8, batch_timeout=0.05)

This waits at most 50 milliseconds before processing whatever requests have accumulated, even if fewer than 8.

Add GPU Acceleration

For GPU serving, set the accelerator parameter on LitServer:

1
server = ls.LitServer(api, accelerator="gpu")

To serve on multiple GPUs with separate workers:

1
2
3
4
5
6
server = ls.LitServer(
    api,
    accelerator="gpu",
    devices=[0, 1],
    workers_per_device=1,
)

This spawns one worker per GPU, each with its own copy of the model. LitServe load-balances across workers. For models that fit comfortably on a single GPU, running 2 workers per device can improve throughput by overlapping data loading with inference:

1
2
3
4
5
6
server = ls.LitServer(
    api,
    accelerator="gpu",
    devices=1,
    workers_per_device=2,
)

Deploy with Docker

Create a Dockerfile for production deployment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
FROM python:3.11-slim

WORKDIR /app

RUN pip install --no-cache-dir litserve transformers torch --index-url https://download.pytorch.org/whl/cpu

COPY server.py .

EXPOSE 8000

CMD ["python", "server.py"]

For GPU deployments, use the NVIDIA PyTorch base image instead:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
FROM pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime

WORKDIR /app

RUN pip install --no-cache-dir litserve transformers

COPY server.py .

EXPOSE 8000

CMD ["python", "server.py"]

Build and run:

1
2
docker build -t litserve-classifier .
docker run --rm -p 8000:8000 litserve-classifier

For GPU access, add the --gpus flag:

1
docker run --rm --gpus all -p 8000:8000 litserve-classifier

Load Testing the Endpoint

Use Python’s concurrent.futures to hammer the endpoint with parallel requests:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# load_test.py
import time
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

URL = "http://localhost:8000/predict"
PAYLOAD = {"text": "LitServe handles batching and GPU scaling automatically."}
NUM_REQUESTS = 200
MAX_WORKERS = 20


def send_request(_):
    start = time.time()
    resp = requests.post(URL, json=PAYLOAD)
    elapsed = time.time() - start
    return resp.status_code, elapsed


start_time = time.time()
results = []

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as pool:
    futures = [pool.submit(send_request, i) for i in range(NUM_REQUESTS)]
    for future in as_completed(futures):
        results.append(future.result())

total_time = time.time() - start_time
success = sum(1 for status, _ in results if status == 200)
latencies = [elapsed for _, elapsed in results]

print(f"Total requests: {NUM_REQUESTS}")
print(f"Successful: {success}")
print(f"Total time: {total_time:.2f}s")
print(f"Throughput: {NUM_REQUESTS / total_time:.1f} req/s")
print(f"Avg latency: {sum(latencies) / len(latencies) * 1000:.1f}ms")
print(f"P99 latency: {sorted(latencies)[int(len(latencies) * 0.99)] * 1000:.1f}ms")

Run it against your server to compare throughput with and without batching:

1
python load_test.py

You should see a meaningful throughput bump when batching is enabled – typically 3-5x on a GPU depending on model size and batch configuration.

Common Errors and Fixes

RuntimeError: CUDA out of memory

Lower max_batch_size or workers_per_device. Each worker loads a full copy of the model, so 2 workers on a 16GB GPU with a 6GB model leaves ~4GB for inference buffers. Start with max_batch_size=4 and increase until you hit memory limits.

Connection refused on Docker

LitServe binds to 127.0.0.1 by default inside the container. You need to bind to 0.0.0.0 so Docker’s port mapping works:

1
server.run(port=8000, host="0.0.0.0")

ModuleNotFoundError: No module named 'litserve'

LitServe requires Python 3.10+. Check your Python version with python --version. If you’re on 3.9 or older, upgrade or use a Docker image with a newer Python.

TypeError: predict() got an unexpected keyword argument

This usually happens when switching between batched and non-batched modes. When max_batch_size > 1, predict receives a list. When it’s 1 or unset, it receives a single value. Make sure your predict method signature matches your batching config.

Slow first request

The first request triggers model loading and compilation. For production, add a warmup request in your setup method:

1
2
3
4
5
6
7
8
def setup(self, device):
    self.classifier = pipeline(
        "text-classification",
        model="distilbert-base-uncased-finetuned-sst-2-english",
        device=device,
    )
    # Warmup inference to trigger JIT compilation
    self.classifier("warmup text")