LitServe is Lightning AI’s framework for turning ML models into production APIs. It wraps FastAPI with features you actually need for inference: automatic batching, GPU device management, and multi-worker scaling. You write a Python class, define four methods, and LitServe handles the rest.
The framework claims 2x faster throughput than plain FastAPI for AI workloads, and from testing, that holds up once you enable batching. Here’s how to build a complete serving pipeline from scratch.
Install Dependencies
You need three packages: litserve for the server, transformers for the model, and torch as the backend.
| |
LitServe requires Python 3.10 or higher. As of this writing, the latest version is 0.2.17.
Define the LitAPI Class
LitServe’s core abstraction is the LitAPI class. You subclass it and implement four methods:
setup(device)– loads your model once at server startupdecode_request(request)– extracts input data from the incoming JSONpredict(x)– runs inference on the decoded inputencode_response(output)– formats the model output as a JSON response
Here’s a complete server that serves a HuggingFace text classification model:
| |
A few things to note:
- The
deviceparameter insetupis managed by LitServe. When you setaccelerator="auto", it pickscudaif a GPU is available, otherwise falls back tocpu. - The HuggingFace
pipelineaccepts adeviceargument directly, so you pass it straight through. predictreceives whateverdecode_requestreturns, not the raw HTTP request.
Start the server:
| |
You’ll see FastAPI/Uvicorn output with the server running on http://localhost:8000.
Test with a Client
Send a POST request to the /predict endpoint:
| |
Or use curl:
| |
Enable Batching for Higher Throughput
Batching groups multiple incoming requests and processes them together in a single forward pass. This is where GPU utilization actually improves – a single request barely scratches the surface of what a GPU can handle.
Pass max_batch_size to the LitAPI constructor:
| |
When batching is enabled, LitServe collects up to 8 requests and sends them through predict as a batch. The pipeline from HuggingFace already handles list inputs, so you need to adjust your API class to process batches:
| |
When max_batch_size is set, predict receives a list of decoded inputs instead of a single value. LitServe handles splitting the batch response back to individual clients automatically – each encode_response call receives one element from the list returned by predict.
You can also set a batch timeout to avoid waiting forever for a full batch:
| |
This waits at most 50 milliseconds before processing whatever requests have accumulated, even if fewer than 8.
Add GPU Acceleration
For GPU serving, set the accelerator parameter on LitServer:
| |
To serve on multiple GPUs with separate workers:
| |
This spawns one worker per GPU, each with its own copy of the model. LitServe load-balances across workers. For models that fit comfortably on a single GPU, running 2 workers per device can improve throughput by overlapping data loading with inference:
| |
Deploy with Docker
Create a Dockerfile for production deployment:
| |
For GPU deployments, use the NVIDIA PyTorch base image instead:
| |
Build and run:
| |
For GPU access, add the --gpus flag:
| |
Load Testing the Endpoint
Use Python’s concurrent.futures to hammer the endpoint with parallel requests:
| |
Run it against your server to compare throughput with and without batching:
| |
You should see a meaningful throughput bump when batching is enabled – typically 3-5x on a GPU depending on model size and batch configuration.
Common Errors and Fixes
RuntimeError: CUDA out of memory
Lower max_batch_size or workers_per_device. Each worker loads a full copy of the model, so 2 workers on a 16GB GPU with a 6GB model leaves ~4GB for inference buffers. Start with max_batch_size=4 and increase until you hit memory limits.
Connection refused on Docker
LitServe binds to 127.0.0.1 by default inside the container. You need to bind to 0.0.0.0 so Docker’s port mapping works:
| |
ModuleNotFoundError: No module named 'litserve'
LitServe requires Python 3.10+. Check your Python version with python --version. If you’re on 3.9 or older, upgrade or use a Docker image with a newer Python.
TypeError: predict() got an unexpected keyword argument
This usually happens when switching between batched and non-batched modes. When max_batch_size > 1, predict receives a list. When it’s 1 or unset, it receives a single value. Make sure your predict method signature matches your batching config.
Slow first request
The first request triggers model loading and compilation. For production, add a warmup request in your setup method:
| |
Related Guides
- How to Build a Model Warm Pool with Preloaded Containers on ECS
- How to Build a Model Warm-Up and Health Check Pipeline with FastAPI
- How to Build a Model Endpoint Load Balancer with NGINX and FastAPI
- How to Build Feature Flags for ML Model Rollouts
- How to Build a Model Deployment Pipeline with Terraform and AWS
- How to Build a Model A/B Testing Framework with FastAPI
- How to Build a Model Monitoring Dashboard with Prometheus and Grafana
- How to Build a Model Explainability Dashboard with SHAP and Streamlit
- How to Build a Model Load Testing Pipeline with Locust and FastAPI
- How to Build a Model Dependency Scanner and Vulnerability Checker