Running inference on a single machine works until it doesn’t. Once you need multiple models, autoscaling, and container isolation, Ray Serve gives you a cluster-native serving layer that handles all of it without a custom orchestrator. You get replica management, traffic routing, and health checks built in – then wrap the whole thing in Docker for reproducible deployments.
Install the dependencies first:
| |
Here’s a single-model deployment to start from:
| |
Save that as serve_app.py and run serve run serve_app:app. You now have a sentiment classifier at http://localhost:8000/classify with two replicas load-balanced automatically.
Configuring Autoscaling
Static replica counts are wasteful. You want replicas that scale with traffic. Ray Serve’s autoscaling_config scales based on the number of queued and in-flight requests per replica.
| |
The key parameters: target_ongoing_requests is the threshold per replica. When average in-flight requests exceed 5, Ray spins up new replicas until hitting max_replicas. The delay parameters prevent flapping – upscale_delay_s waits 10 seconds of sustained load before adding replicas, and downscale_delay_s waits a full minute of reduced load before removing them. This keeps your cluster stable during bursty traffic patterns.
Serving Multiple Models Behind One Endpoint
The real power of Ray Serve is model multiplexing. You can route requests to different models from a single ingress deployment, which means one endpoint URL and one load balancer.
| |
Now a single POST to /predict with {"model": "sentiment", "text": "Ray Serve is great"} routes to the sentiment model, and {"model": "summarization", "text": "..."} routes to the summarizer. Each model scales independently. The Router deployment acts as the ingress and forwards calls through Ray Serve’s internal handle system, which handles serialization, load balancing, and retry logic transparently.
Containerizing with Docker
For production, wrap the whole cluster in a Docker container. This Dockerfile installs Ray, downloads models at build time (so cold starts don’t hit model registries), and exposes the serve endpoint.
| |
Build and run it:
| |
The HEALTHCHECK directive uses Ray Serve’s built-in health endpoint at /-/healthz. Docker will restart the container if the health check fails three times consecutively. The --start-period=60s gives the models time to load before health checks start counting failures.
Adding Health Checks and Monitoring
Ray Serve exposes a health check mechanism per deployment through the check_health method. If a replica’s health check fails, Ray automatically restarts it.
| |
The health_check_period_s=15 runs the check every 15 seconds per replica. If check_health raises any exception, Ray marks that replica as unhealthy and restarts it. The timeout prevents a hung health check from blocking the system. This gives you self-healing behavior without an external watchdog – replicas that accumulate too many errors or lose their model state get recycled automatically.
To query the built-in metrics endpoint for Prometheus scraping:
| |
You’ll get counters for request latency, queue depth, replica count, and error rates per deployment – everything you need to wire into Grafana or your existing monitoring stack.
Common Errors and Fixes
RayServeException: Cannot call .remote() on a DeploymentHandle that is not running – This happens when you try to call a handle before serve.run() completes. Make sure all .bind() calls happen before serve.run(), and don’t call handles from module-level code. Wrap handle calls inside deployment methods.
RuntimeError: No available replicas for deployment – Your autoscaling min is set to 0 and traffic arrived before a replica spun up, or all replicas are unhealthy. Set min_replicas: 1 for any deployment that needs to handle requests without cold start delays. If replicas are crashing, check check_health logs with ray logs serve/.
Container health check keeps failing – The --start-period in the Dockerfile HEALTHCHECK might be too short for model loading. Large models like distilbart-cnn-12-6 can take 30-60 seconds to download and load. Increase --start-period to 120s for larger models, or pre-download them during the Docker build step (as shown in the Dockerfile above).
Address already in use when starting Ray Serve – Another Ray or Serve process is already bound to port 8000 or 6379. Kill it with ray stop --force before restarting. Inside Docker, make sure you aren’t running multiple serve run processes.
Memory issues with multiple models – Each model replica loads its own copy of model weights. Two replicas of a 500MB model uses 1GB of RAM. Set max_replicas conservatively and use ray_actor_options={"memory": 1e9} to give Ray accurate resource information so it doesn’t over-schedule replicas on a single node.
Related Guides
- How to Build a Model Serving Autoscaler with Custom Metrics and Kubernetes
- How to Build a Model Serving Cost Dashboard with Prometheus and Grafana
- How to Optimize Docker Images for ML Model Serving
- How to Build a Model Registry with S3 and DynamoDB
- How to Build a Model Artifact CDN with CloudFront and S3
- How to Build a Model Inference Queue with Celery and Redis
- How to Build a Model Serving Gateway with Envoy and gRPC
- How to Build a Model Inference Cost Tracking Pipeline with OpenTelemetry
- How to Build a Model Artifact Garbage Collection Pipeline with S3 Lifecycle Rules
- How to Scale ML Training and Inference with Ray