Running inference on a single machine works until it doesn’t. Once you need multiple models, autoscaling, and container isolation, Ray Serve gives you a cluster-native serving layer that handles all of it without a custom orchestrator. You get replica management, traffic routing, and health checks built in – then wrap the whole thing in Docker for reproducible deployments.

Install the dependencies first:

1
pip install "ray[serve]" transformers torch requests

Here’s a single-model deployment to start from:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from ray import serve
from transformers import pipeline


@serve.deployment(
    num_replicas=2,
    ray_actor_options={"num_cpus": 1},
)
class TextClassifier:
    def __init__(self):
        self.model = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
        )

    async def __call__(self, request):
        data = await request.json()
        result = self.model(data["text"])
        return {"label": result[0]["label"], "score": round(result[0]["score"], 4)}


app = TextClassifier.bind()
serve.run(app, route_prefix="/classify")

Save that as serve_app.py and run serve run serve_app:app. You now have a sentiment classifier at http://localhost:8000/classify with two replicas load-balanced automatically.

Configuring Autoscaling

Static replica counts are wasteful. You want replicas that scale with traffic. Ray Serve’s autoscaling_config scales based on the number of queued and in-flight requests per replica.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from ray import serve
from transformers import pipeline


@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_ongoing_requests": 5,
        "upscale_delay_s": 10,
        "downscale_delay_s": 60,
    },
    ray_actor_options={"num_cpus": 1},
)
class TextClassifier:
    def __init__(self):
        self.model = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
        )

    async def __call__(self, request):
        data = await request.json()
        result = self.model(data["text"])
        return {"label": result[0]["label"], "score": round(result[0]["score"], 4)}


app = TextClassifier.bind()
serve.run(app, route_prefix="/classify")

The key parameters: target_ongoing_requests is the threshold per replica. When average in-flight requests exceed 5, Ray spins up new replicas until hitting max_replicas. The delay parameters prevent flapping – upscale_delay_s waits 10 seconds of sustained load before adding replicas, and downscale_delay_s waits a full minute of reduced load before removing them. This keeps your cluster stable during bursty traffic patterns.

Serving Multiple Models Behind One Endpoint

The real power of Ray Serve is model multiplexing. You can route requests to different models from a single ingress deployment, which means one endpoint URL and one load balancer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from ray import serve
from transformers import pipeline
from starlette.requests import Request


@serve.deployment(
    autoscaling_config={"min_replicas": 1, "max_replicas": 5, "target_ongoing_requests": 3},
    ray_actor_options={"num_cpus": 1},
)
class SentimentModel:
    def __init__(self):
        self.model = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
        )

    async def __call__(self, text: str):
        result = self.model(text)
        return {"model": "sentiment", "label": result[0]["label"], "score": round(result[0]["score"], 4)}


@serve.deployment(
    autoscaling_config={"min_replicas": 1, "max_replicas": 5, "target_ongoing_requests": 3},
    ray_actor_options={"num_cpus": 1},
)
class SummarizationModel:
    def __init__(self):
        self.model = pipeline(
            "summarization",
            model="sshleifer/distilbart-cnn-12-6",
        )

    async def __call__(self, text: str):
        result = self.model(text, max_length=80, min_length=20)
        return {"model": "summarization", "summary": result[0]["summary_text"]}


@serve.deployment(num_replicas=1)
class Router:
    def __init__(self, sentiment_handle, summarization_handle):
        self.models = {
            "sentiment": sentiment_handle,
            "summarization": summarization_handle,
        }

    async def __call__(self, request: Request):
        data = await request.json()
        model_name = data.get("model", "sentiment")
        text = data.get("text", "")

        if model_name not in self.models:
            return {"error": f"Unknown model: {model_name}. Available: {list(self.models.keys())}"}

        handle = self.models[model_name]
        result = await handle.remote(text)
        return result


sentiment = SentimentModel.bind()
summarization = SummarizationModel.bind()
app = Router.bind(sentiment, summarization)

serve.run(app, route_prefix="/predict")

Now a single POST to /predict with {"model": "sentiment", "text": "Ray Serve is great"} routes to the sentiment model, and {"model": "summarization", "text": "..."} routes to the summarizer. Each model scales independently. The Router deployment acts as the ingress and forwards calls through Ray Serve’s internal handle system, which handles serialization, load balancing, and retry logic transparently.

Containerizing with Docker

For production, wrap the whole cluster in a Docker container. This Dockerfile installs Ray, downloads models at build time (so cold starts don’t hit model registries), and exposes the serve endpoint.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
FROM python:3.11-slim

WORKDIR /app

RUN pip install --no-cache-dir "ray[serve]" transformers torch --extra-index-url https://download.pytorch.org/whl/cpu

# Pre-download models during build to avoid runtime downloads
RUN python -c "from transformers import pipeline; \
    pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english'); \
    pipeline('summarization', model='sshleifer/distilbart-cnn-12-6')"

COPY serve_app.py .

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
    CMD python -c "import requests; r = requests.get('http://localhost:8000/-/healthz'); r.raise_for_status()"

CMD ["serve", "run", "serve_app:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run it:

1
2
docker build -t ray-serve-cluster .
docker run -d --name ray-serve -p 8000:8000 ray-serve-cluster

The HEALTHCHECK directive uses Ray Serve’s built-in health endpoint at /-/healthz. Docker will restart the container if the health check fails three times consecutively. The --start-period=60s gives the models time to load before health checks start counting failures.

Adding Health Checks and Monitoring

Ray Serve exposes a health check mechanism per deployment through the check_health method. If a replica’s health check fails, Ray automatically restarts it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from ray import serve
from transformers import pipeline


@serve.deployment(
    autoscaling_config={"min_replicas": 1, "max_replicas": 5, "target_ongoing_requests": 5},
    health_check_period_s=15,
    health_check_timeout_s=10,
    ray_actor_options={"num_cpus": 1},
)
class MonitoredClassifier:
    def __init__(self):
        self.model = pipeline(
            "sentiment-analysis",
            model="distilbert-base-uncased-finetuned-sst-2-english",
        )
        self.request_count = 0
        self.error_count = 0

    def check_health(self):
        """Ray Serve calls this periodically. Raise an exception to trigger replica restart."""
        if self.error_count > 100:
            raise RuntimeError(f"Too many errors: {self.error_count}. Restarting replica.")
        # Verify model is still loaded and functional
        test_result = self.model("health check")
        if not test_result:
            raise RuntimeError("Model returned empty result during health check.")

    async def __call__(self, request):
        self.request_count += 1
        try:
            data = await request.json()
            result = self.model(data["text"])
            return {
                "label": result[0]["label"],
                "score": round(result[0]["score"], 4),
                "replica_stats": {
                    "total_requests": self.request_count,
                    "errors": self.error_count,
                },
            }
        except Exception as e:
            self.error_count += 1
            return {"error": str(e)}


app = MonitoredClassifier.bind()
serve.run(app, route_prefix="/classify")

The health_check_period_s=15 runs the check every 15 seconds per replica. If check_health raises any exception, Ray marks that replica as unhealthy and restarts it. The timeout prevents a hung health check from blocking the system. This gives you self-healing behavior without an external watchdog – replicas that accumulate too many errors or lose their model state get recycled automatically.

To query the built-in metrics endpoint for Prometheus scraping:

1
2
# Ray Serve exports Prometheus metrics on port 8080 by default
curl http://localhost:8080/metrics | grep ray_serve

You’ll get counters for request latency, queue depth, replica count, and error rates per deployment – everything you need to wire into Grafana or your existing monitoring stack.

Common Errors and Fixes

RayServeException: Cannot call .remote() on a DeploymentHandle that is not running – This happens when you try to call a handle before serve.run() completes. Make sure all .bind() calls happen before serve.run(), and don’t call handles from module-level code. Wrap handle calls inside deployment methods.

RuntimeError: No available replicas for deployment – Your autoscaling min is set to 0 and traffic arrived before a replica spun up, or all replicas are unhealthy. Set min_replicas: 1 for any deployment that needs to handle requests without cold start delays. If replicas are crashing, check check_health logs with ray logs serve/.

Container health check keeps failing – The --start-period in the Dockerfile HEALTHCHECK might be too short for model loading. Large models like distilbart-cnn-12-6 can take 30-60 seconds to download and load. Increase --start-period to 120s for larger models, or pre-download them during the Docker build step (as shown in the Dockerfile above).

Address already in use when starting Ray Serve – Another Ray or Serve process is already bound to port 8000 or 6379. Kill it with ray stop --force before restarting. Inside Docker, make sure you aren’t running multiple serve run processes.

Memory issues with multiple models – Each model replica loads its own copy of model weights. Two replicas of a 500MB model uses 1GB of RAM. Set max_replicas conservatively and use ray_actor_options={"memory": 1e9} to give Ray accurate resource information so it doesn’t over-schedule replicas on a single node.