The Quick Version

BentoML turns your trained model into a production API with request batching, GPU support, and Docker packaging. You save a model, write a service class, and BentoML handles the rest — serialization, HTTP server, OpenAPI docs, and container builds.

1
pip install bentoml torch torchvision
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import bentoml
import torch
from torchvision.models import resnet50, ResNet50_Weights

# Step 1: Save your trained model to BentoML's model store
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()

saved_model = bentoml.pytorch.save_model(
    "resnet50_classifier",
    model,
    signatures={"__call__": {"batchable": True, "batch_dim": 0}},
)
print(f"Saved: {saved_model.tag}")
# Saved: resnet50_classifier:abc123
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Step 2: Create a service (service.py)
import bentoml
import torch
import numpy as np
from PIL import Image
from torchvision import transforms

runner = bentoml.pytorch.get("resnet50_classifier:latest").to_runner()
svc = bentoml.Service("image_classifier", runners=[runner])

preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

@svc.api(input=bentoml.io.Image(), output=bentoml.io.JSON())
async def classify(image: Image.Image) -> dict:
    tensor = preprocess(image).unsqueeze(0)
    output = await runner.async_run(tensor)
    probs = torch.softmax(torch.tensor(output), dim=1)
    top5 = torch.topk(probs, 5)

    return {
        "predictions": [
            {"class_id": idx.item(), "confidence": prob.item()}
            for prob, idx in zip(top5.values[0], top5.indices[0])
        ]
    }
1
2
3
4
5
# Step 3: Run the service
bentoml serve service:svc --reload

# Test it
curl -X POST http://localhost:3000/classify -F "[email protected]"

That gives you a production-grade API server with automatic OpenAPI docs at /docs, health checks, and Prometheus metrics.

Request Batching for GPU Efficiency

GPUs are most efficient when processing batches, not individual requests. BentoML automatically batches incoming requests.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import bentoml
import torch

runner = bentoml.pytorch.get("resnet50_classifier:latest").to_runner()
svc = bentoml.Service("image_classifier", runners=[runner])

# Configure batching
runner = bentoml.pytorch.get("resnet50_classifier:latest").to_runner(
    max_batch_size=32,      # combine up to 32 requests into one batch
    max_latency_ms=100,     # wait up to 100ms to fill the batch
)

With 100 concurrent requests, instead of running inference 100 times (one image each), BentoML groups them into ~4 batches of ~25 images. This can be 10-20x faster on a GPU where the bottleneck is kernel launch overhead, not compute.

The max_latency_ms setting is the key tradeoff: higher values collect larger batches (better throughput) but add latency for individual requests. For real-time APIs, keep it at 50-100ms. For batch processing, set it to 500ms+.

Multi-Model Services

Serve multiple models from a single service — common for pipelines like “detect objects, then classify each one”:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import bentoml
from PIL import Image

detector_runner = bentoml.pytorch.get("yolo_detector:latest").to_runner()
classifier_runner = bentoml.pytorch.get("product_classifier:latest").to_runner()

svc = bentoml.Service(
    "detection_pipeline",
    runners=[detector_runner, classifier_runner],
)

@svc.api(input=bentoml.io.Image(), output=bentoml.io.JSON())
async def detect_and_classify(image: Image.Image) -> dict:
    # Step 1: Detect objects
    detections = await detector_runner.async_run(preprocess_detection(image))

    # Step 2: Classify each detected region
    results = []
    for bbox in detections:
        crop = image.crop(bbox)
        classification = await classifier_runner.async_run(preprocess_classify(crop))
        results.append({
            "bbox": bbox,
            "class": classification["label"],
            "confidence": classification["score"],
        })

    return {"detections": results}

Each runner can run on a different device — put the detector on GPU 0 and the classifier on GPU 1, or run lightweight models on CPU while the heavy model uses the GPU.

Building and Deploying Containers

BentoML builds optimized Docker containers with a single command:

1
2
3
4
5
6
7
8
# Build a Bento (packaged artifact)
bentoml build

# Containerize it
bentoml containerize image_classifier:latest

# Run the container
docker run --gpus all -p 3000:3000 image_classifier:latest

The generated Dockerfile handles Python dependencies, model weights, and CUDA setup automatically. For custom requirements, add a bentofile.yaml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# bentofile.yaml
service: "service:svc"
include:
  - "*.py"
python:
  packages:
    - torch>=2.0
    - torchvision
    - pillow
docker:
  base_image: "pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime"
  env:
    BENTOML_PORT: "3000"
1
2
bentoml build -f bentofile.yaml
bentoml containerize image_classifier:latest --platform linux/amd64

Serving Hugging Face Models

BentoML integrates directly with Hugging Face transformers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import bentoml
from transformers import pipeline

# Save a HuggingFace pipeline
sentiment = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
bentoml.transformers.save_model("sentiment_analyzer", sentiment)

# Create service
runner = bentoml.transformers.get("sentiment_analyzer:latest").to_runner()
svc = bentoml.Service("sentiment_api", runners=[runner])

@svc.api(input=bentoml.io.Text(), output=bentoml.io.JSON())
async def analyze(text: str) -> dict:
    result = await runner.async_run(text)
    return result[0]
1
2
3
4
5
6
bentoml serve service:svc

# Test
curl -X POST http://localhost:3000/analyze \
  -H "Content-Type: text/plain" \
  -d "This product is amazing!"

Adding Authentication and Rate Limiting

For production deployments, add middleware for auth and rate limits:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import bentoml
from starlette.middleware import Middleware
from starlette.middleware.cors import CORSMiddleware

svc = bentoml.Service(
    "secure_api",
    runners=[runner],
)

# Add CORS
svc.add_asgi_middleware(
    CORSMiddleware,
    allow_origins=["https://yourdomain.com"],
    allow_methods=["POST"],
)

# Custom auth middleware
from starlette.requests import Request
from starlette.responses import JSONResponse

@svc.middleware
async def auth_middleware(request: Request, next_handler):
    api_key = request.headers.get("X-API-Key")
    valid_keys = {"key_abc123", "key_def456"}

    if request.url.path != "/healthz" and api_key not in valid_keys:
        return JSONResponse({"error": "Invalid API key"}, status_code=401)

    return await next_handler(request)

Common Errors and Fixes

bentoml.exceptions.NotFound: Model not found

You haven’t saved the model to BentoML’s store. Run bentoml.pytorch.save_model() first, or check available models with bentoml models list.

GPU not detected in container

Run with docker run --gpus all. The base image must include CUDA — use pytorch/pytorch:*-cuda* as the base. Check GPU availability inside the container: python -c "import torch; print(torch.cuda.is_available())".

Batching doesn’t improve throughput

Your model is already I/O bound, not compute bound. Batching helps most when GPU compute is the bottleneck. Check GPU utilization with nvidia-smi — if it’s below 50%, batching won’t help much.

Service crashes on large file uploads

Set upload limits in the BentoML config:

1
2
3
4
5
6
7
8
@svc.api(
    input=bentoml.io.Image(),
    output=bentoml.io.JSON(),
    route="/classify",
)
async def classify(image):
    # BentoML handles size limits via server config
    ...

Configure in bentoml_configuration.yaml: api_server.max_request_size: 10485760 (10MB).

Cold start takes too long

Model loading happens when the first request arrives. Add a warmup in your service:

1
2
3
4
5
@svc.on_startup
async def warmup():
    dummy = torch.randn(1, 3, 224, 224)
    await runner.async_run(dummy)
    print("Model warmed up")