Ray Serve is the best way to serve ML models when you need autoscaling without wrestling with Kubernetes configs directly. Pair it with FastAPI and you get a production-grade inference endpoint that scales from one replica to dozens based on traffic, handles request batching out of the box, and gives you full control over the HTTP layer.

Here’s the minimal setup to get a Hugging Face sentiment analysis model running behind Ray Serve and FastAPI:

1
pip install "ray[serve]" fastapi uvicorn transformers torch
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from ray import serve
from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()

@serve.deployment(num_replicas=1)
@serve.ingress(app)
class SentimentService:
    def __init__(self):
        self.model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

    @app.post("/predict")
    async def predict(self, text: str):
        result = self.model(text)
        return {"label": result[0]["label"], "score": round(result[0]["score"], 4)}

    @app.get("/health")
    async def health(self):
        return {"status": "ok"}

serve.run(SentimentService.bind(), route_prefix="/")

Run this script and you have a live endpoint at http://localhost:8000/predict. That’s it for the basics. Now let’s make it production-ready.

Configuring Autoscaling

Static replica counts waste resources. During off-peak hours you’re paying for idle GPUs; during traffic spikes your users hit timeouts. Ray Serve’s autoscaling_config fixes this by scaling replicas based on the number of in-flight requests per replica.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from ray import serve
from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_ongoing_requests": 5,
        "upscale_delay_s": 10,
        "downscale_delay_s": 60,
    },
    ray_actor_options={"num_cpus": 1, "num_gpus": 0},
    max_ongoing_requests=10,
)
@serve.ingress(app)
class SentimentService:
    def __init__(self):
        self.model = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
        )

    @app.post("/predict")
    async def predict(self, text: str):
        result = self.model(text)
        return {"label": result[0]["label"], "score": round(result[0]["score"], 4)}

    @app.get("/health")
    async def health(self):
        return {"status": "ok"}

serve.run(SentimentService.bind(), route_prefix="/")

Key parameters to understand:

  • target_ongoing_requests: The autoscaler tries to keep this many concurrent requests per replica. Set it lower for latency-sensitive workloads, higher for throughput-heavy ones. 5 is a solid starting point for transformer models on CPU.
  • upscale_delay_s: How long traffic must stay elevated before adding replicas. 10 seconds prevents flapping from short bursts.
  • downscale_delay_s: How long traffic must stay low before removing replicas. Set this higher (60s+) to avoid thrashing during intermittent traffic patterns.
  • max_ongoing_requests: The hard cap per replica. Requests beyond this get queued at the proxy level. Keep this at roughly 2x your target_ongoing_requests.

If you’re serving on GPU, set "num_gpus": 1 in ray_actor_options and Ray will schedule one replica per GPU automatically.

Request Batching for Throughput

Transformer models are much more efficient when you batch inputs together. A single forward pass on 16 inputs is faster than 16 individual passes. Ray Serve’s @serve.batch decorator collects incoming requests and groups them before calling your model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from typing import List
from ray import serve
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI()

class PredictRequest(BaseModel):
    text: str

class PredictResponse(BaseModel):
    label: str
    score: float

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_ongoing_requests": 5,
    },
    max_ongoing_requests=20,
)
@serve.ingress(app)
class SentimentService:
    def __init__(self):
        self.model = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
        )

    @serve.batch(max_batch_size=16, batch_wait_timeout_s=0.1)
    async def _batched_predict(self, texts: List[str]) -> List[PredictResponse]:
        results = self.model(texts, batch_size=len(texts))
        return [
            PredictResponse(label=r["label"], score=round(r["score"], 4))
            for r in results
        ]

    @app.post("/predict", response_model=PredictResponse)
    async def predict(self, request: PredictRequest):
        return await self._batched_predict(request.text)

    @app.get("/health")
    async def health(self):
        return {"status": "ok"}

serve.run(SentimentService.bind(), route_prefix="/")

The @serve.batch decorator does the heavy lifting. Each individual call to _batched_predict passes a single str, but Ray collects up to 16 of them into a List[str] before invoking the method. The return list must match the input list length – Ray maps each result back to its original caller.

Two parameters matter here:

  • max_batch_size: Maximum inputs per batch. Match this to what your GPU memory allows. 16 is conservative for DistilBERT; you can go higher for smaller models.
  • batch_wait_timeout_s: How long to wait for a full batch before sending a partial one. 0.1 seconds keeps latency tight. Increase this if you care more about throughput than per-request latency.

Health Checks and Graceful Shutdown

Production deployments need health checks for load balancers and orchestrators to route traffic correctly. You also need clean shutdown so in-flight requests finish before a replica goes away.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import asyncio
import logging
from contextlib import asynccontextmanager
from typing import List

from ray import serve
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

logger = logging.getLogger("ray.serve")

@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info("SentimentService replica starting up")
    yield
    logger.info("SentimentService replica shutting down")

app = FastAPI(lifespan=lifespan)

class PredictRequest(BaseModel):
    text: str

class PredictResponse(BaseModel):
    label: str
    score: float

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_ongoing_requests": 5,
    },
    max_ongoing_requests=20,
    graceful_shutdown_timeout_s=30,
    health_check_period_s=10,
    health_check_timeout_s=5,
)
@serve.ingress(app)
class SentimentService:
    def __init__(self):
        self.model = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
        )
        self.ready = True

    def check_health(self):
        if not self.ready:
            raise RuntimeError("Model not ready")

    @serve.batch(max_batch_size=16, batch_wait_timeout_s=0.1)
    async def _batched_predict(self, texts: List[str]) -> List[PredictResponse]:
        results = self.model(texts, batch_size=len(texts))
        return [
            PredictResponse(label=r["label"], score=round(r["score"], 4))
            for r in results
        ]

    @app.post("/predict", response_model=PredictResponse)
    async def predict(self, request: PredictRequest):
        return await self._batched_predict(request.text)

    @app.get("/health")
    async def health(self):
        return {"status": "ok", "model_loaded": self.ready}

    @app.get("/ready")
    async def ready(self):
        if not self.ready:
            return {"status": "not_ready"}, 503
        return {"status": "ready"}

serve.run(SentimentService.bind(), route_prefix="/")

Ray Serve has two layers of health checking:

  1. check_health() method: Ray calls this automatically at health_check_period_s intervals. If it raises an exception, Ray restarts the replica. Use this for internal checks like model state validation.
  2. HTTP health endpoints: Your load balancer or Kubernetes readiness probe hits /health or /ready. These are standard FastAPI routes you control entirely.

The graceful_shutdown_timeout_s parameter gives in-flight requests 30 seconds to complete before Ray forcefully kills the replica. Set this based on your worst-case inference time.

Testing the Endpoint

Once your service is running, test it with curl:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Single prediction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "Ray Serve makes model deployment straightforward"}'

# Health check
curl http://localhost:8000/health

# Check Ray Serve status
serve status

Expected output from the predict endpoint:

1
{"label": "POSITIVE", "score": 0.9998}

You can also check autoscaling behavior by sending concurrent requests:

1
2
3
4
5
6
7
# Install hey for load testing
# go install github.com/rakyll/hey@latest

hey -n 200 -c 20 -m POST \
  -H "Content-Type: application/json" \
  -d '{"text": "testing autoscaling behavior"}' \
  http://localhost:8000/predict

Watch replicas scale up with serve status in another terminal. You should see the replica count climb as concurrent requests increase past the target_ongoing_requests threshold.

Common Errors and Fixes

RayServeException: Cannot call __init__ on a deployment handle

You called .bind() but passed arguments incorrectly. Make sure constructor arguments go inside .bind(arg1, arg2), not in the decorator.

TypeError: check_health() must be a sync function

The check_health method cannot be async. Ray calls it in a synchronous context. Remove async from the method definition.

ValueError: Batch size mismatch - expected N results, got M

Your @serve.batch method must return exactly one result per input. If your model returns a different number of outputs than inputs, you have a bug in your batching logic. Double-check that len(results) == len(texts) before returning.

RuntimeError: No available node to schedule this deployment

You requested more resources (GPUs/CPUs) than your Ray cluster has. Either reduce max_replicas, reduce num_gpus in ray_actor_options, or add more nodes to the cluster with ray start --address=<head-node>:6379.

Replicas not scaling down after traffic drops

This is usually downscale_delay_s doing its job. The default is 600 seconds (10 minutes). If you want faster downscaling, reduce this value. Setting it below 30 seconds risks thrashing in bursty traffic patterns.

ImportError: cannot import name 'serve' from 'ray'

You need the serve extra: pip install "ray[serve]". The base ray package does not include Ray Serve.