The Quick Version#
BentoML turns your trained model into a production API with request batching, GPU support, and Docker packaging. You save a model, write a service class, and BentoML handles the rest — serialization, HTTP server, OpenAPI docs, and container builds.
1
| pip install bentoml torch torchvision
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import bentoml
import torch
from torchvision.models import resnet50, ResNet50_Weights
# Step 1: Save your trained model to BentoML's model store
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()
saved_model = bentoml.pytorch.save_model(
"resnet50_classifier",
model,
signatures={"__call__": {"batchable": True, "batch_dim": 0}},
)
print(f"Saved: {saved_model.tag}")
# Saved: resnet50_classifier:abc123
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| # Step 2: Create a service (service.py)
import bentoml
import torch
import numpy as np
from PIL import Image
from torchvision import transforms
runner = bentoml.pytorch.get("resnet50_classifier:latest").to_runner()
svc = bentoml.Service("image_classifier", runners=[runner])
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
@svc.api(input=bentoml.io.Image(), output=bentoml.io.JSON())
async def classify(image: Image.Image) -> dict:
tensor = preprocess(image).unsqueeze(0)
output = await runner.async_run(tensor)
probs = torch.softmax(torch.tensor(output), dim=1)
top5 = torch.topk(probs, 5)
return {
"predictions": [
{"class_id": idx.item(), "confidence": prob.item()}
for prob, idx in zip(top5.values[0], top5.indices[0])
]
}
|
1
2
3
4
5
| # Step 3: Run the service
bentoml serve service:svc --reload
# Test it
curl -X POST http://localhost:3000/classify -F "[email protected]"
|
That gives you a production-grade API server with automatic OpenAPI docs at /docs, health checks, and Prometheus metrics.
Request Batching for GPU Efficiency#
GPUs are most efficient when processing batches, not individual requests. BentoML automatically batches incoming requests.
1
2
3
4
5
6
7
8
9
10
11
| import bentoml
import torch
runner = bentoml.pytorch.get("resnet50_classifier:latest").to_runner()
svc = bentoml.Service("image_classifier", runners=[runner])
# Configure batching
runner = bentoml.pytorch.get("resnet50_classifier:latest").to_runner(
max_batch_size=32, # combine up to 32 requests into one batch
max_latency_ms=100, # wait up to 100ms to fill the batch
)
|
With 100 concurrent requests, instead of running inference 100 times (one image each), BentoML groups them into ~4 batches of ~25 images. This can be 10-20x faster on a GPU where the bottleneck is kernel launch overhead, not compute.
The max_latency_ms setting is the key tradeoff: higher values collect larger batches (better throughput) but add latency for individual requests. For real-time APIs, keep it at 50-100ms. For batch processing, set it to 500ms+.
Multi-Model Services#
Serve multiple models from a single service — common for pipelines like “detect objects, then classify each one”:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| import bentoml
from PIL import Image
detector_runner = bentoml.pytorch.get("yolo_detector:latest").to_runner()
classifier_runner = bentoml.pytorch.get("product_classifier:latest").to_runner()
svc = bentoml.Service(
"detection_pipeline",
runners=[detector_runner, classifier_runner],
)
@svc.api(input=bentoml.io.Image(), output=bentoml.io.JSON())
async def detect_and_classify(image: Image.Image) -> dict:
# Step 1: Detect objects
detections = await detector_runner.async_run(preprocess_detection(image))
# Step 2: Classify each detected region
results = []
for bbox in detections:
crop = image.crop(bbox)
classification = await classifier_runner.async_run(preprocess_classify(crop))
results.append({
"bbox": bbox,
"class": classification["label"],
"confidence": classification["score"],
})
return {"detections": results}
|
Each runner can run on a different device — put the detector on GPU 0 and the classifier on GPU 1, or run lightweight models on CPU while the heavy model uses the GPU.
Building and Deploying Containers#
BentoML builds optimized Docker containers with a single command:
1
2
3
4
5
6
7
8
| # Build a Bento (packaged artifact)
bentoml build
# Containerize it
bentoml containerize image_classifier:latest
# Run the container
docker run --gpus all -p 3000:3000 image_classifier:latest
|
The generated Dockerfile handles Python dependencies, model weights, and CUDA setup automatically. For custom requirements, add a bentofile.yaml:
1
2
3
4
5
6
7
8
9
10
11
12
13
| # bentofile.yaml
service: "service:svc"
include:
- "*.py"
python:
packages:
- torch>=2.0
- torchvision
- pillow
docker:
base_image: "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
env:
BENTOML_PORT: "3000"
|
1
2
| bentoml build -f bentofile.yaml
bentoml containerize image_classifier:latest --platform linux/amd64
|
Serving Hugging Face Models#
BentoML integrates directly with Hugging Face transformers:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import bentoml
from transformers import pipeline
# Save a HuggingFace pipeline
sentiment = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
bentoml.transformers.save_model("sentiment_analyzer", sentiment)
# Create service
runner = bentoml.transformers.get("sentiment_analyzer:latest").to_runner()
svc = bentoml.Service("sentiment_api", runners=[runner])
@svc.api(input=bentoml.io.Text(), output=bentoml.io.JSON())
async def analyze(text: str) -> dict:
result = await runner.async_run(text)
return result[0]
|
1
2
3
4
5
6
| bentoml serve service:svc
# Test
curl -X POST http://localhost:3000/analyze \
-H "Content-Type: text/plain" \
-d "This product is amazing!"
|
Adding Authentication and Rate Limiting#
For production deployments, add middleware for auth and rate limits:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| import bentoml
from starlette.middleware import Middleware
from starlette.middleware.cors import CORSMiddleware
svc = bentoml.Service(
"secure_api",
runners=[runner],
)
# Add CORS
svc.add_asgi_middleware(
CORSMiddleware,
allow_origins=["https://yourdomain.com"],
allow_methods=["POST"],
)
# Custom auth middleware
from starlette.requests import Request
from starlette.responses import JSONResponse
@svc.middleware
async def auth_middleware(request: Request, next_handler):
api_key = request.headers.get("X-API-Key")
valid_keys = {"key_abc123", "key_def456"}
if request.url.path != "/healthz" and api_key not in valid_keys:
return JSONResponse({"error": "Invalid API key"}, status_code=401)
return await next_handler(request)
|
Common Errors and Fixes#
bentoml.exceptions.NotFound: Model not found
You haven’t saved the model to BentoML’s store. Run bentoml.pytorch.save_model() first, or check available models with bentoml models list.
GPU not detected in container
Run with docker run --gpus all. The base image must include CUDA — use pytorch/pytorch:*-cuda* as the base. Check GPU availability inside the container: python -c "import torch; print(torch.cuda.is_available())".
Batching doesn’t improve throughput
Your model is already I/O bound, not compute bound. Batching helps most when GPU compute is the bottleneck. Check GPU utilization with nvidia-smi — if it’s below 50%, batching won’t help much.
Service crashes on large file uploads
Set upload limits in the BentoML config:
1
2
3
4
5
6
7
8
| @svc.api(
input=bentoml.io.Image(),
output=bentoml.io.JSON(),
route="/classify",
)
async def classify(image):
# BentoML handles size limits via server config
...
|
Configure in bentoml_configuration.yaml: api_server.max_request_size: 10485760 (10MB).
Cold start takes too long
Model loading happens when the first request arrives. Add a warmup in your service:
1
2
3
4
5
| @svc.on_startup
async def warmup():
dummy = torch.randn(1, 3, 224, 224)
await runner.async_run(dummy)
print("Model warmed up")
|