Ray Serve is the best way to serve ML models when you need autoscaling without wrestling with Kubernetes configs directly. Pair it with FastAPI and you get a production-grade inference endpoint that scales from one replica to dozens based on traffic, handles request batching out of the box, and gives you full control over the HTTP layer.
Here’s the minimal setup to get a Hugging Face sentiment analysis model running behind Ray Serve and FastAPI:
| |
| |
Run this script and you have a live endpoint at http://localhost:8000/predict. That’s it for the basics. Now let’s make it production-ready.
Configuring Autoscaling
Static replica counts waste resources. During off-peak hours you’re paying for idle GPUs; during traffic spikes your users hit timeouts. Ray Serve’s autoscaling_config fixes this by scaling replicas based on the number of in-flight requests per replica.
| |
Key parameters to understand:
target_ongoing_requests: The autoscaler tries to keep this many concurrent requests per replica. Set it lower for latency-sensitive workloads, higher for throughput-heavy ones. 5 is a solid starting point for transformer models on CPU.upscale_delay_s: How long traffic must stay elevated before adding replicas. 10 seconds prevents flapping from short bursts.downscale_delay_s: How long traffic must stay low before removing replicas. Set this higher (60s+) to avoid thrashing during intermittent traffic patterns.max_ongoing_requests: The hard cap per replica. Requests beyond this get queued at the proxy level. Keep this at roughly 2x yourtarget_ongoing_requests.
If you’re serving on GPU, set "num_gpus": 1 in ray_actor_options and Ray will schedule one replica per GPU automatically.
Request Batching for Throughput
Transformer models are much more efficient when you batch inputs together. A single forward pass on 16 inputs is faster than 16 individual passes. Ray Serve’s @serve.batch decorator collects incoming requests and groups them before calling your model.
| |
The @serve.batch decorator does the heavy lifting. Each individual call to _batched_predict passes a single str, but Ray collects up to 16 of them into a List[str] before invoking the method. The return list must match the input list length – Ray maps each result back to its original caller.
Two parameters matter here:
max_batch_size: Maximum inputs per batch. Match this to what your GPU memory allows. 16 is conservative for DistilBERT; you can go higher for smaller models.batch_wait_timeout_s: How long to wait for a full batch before sending a partial one. 0.1 seconds keeps latency tight. Increase this if you care more about throughput than per-request latency.
Health Checks and Graceful Shutdown
Production deployments need health checks for load balancers and orchestrators to route traffic correctly. You also need clean shutdown so in-flight requests finish before a replica goes away.
| |
Ray Serve has two layers of health checking:
check_health()method: Ray calls this automatically athealth_check_period_sintervals. If it raises an exception, Ray restarts the replica. Use this for internal checks like model state validation.- HTTP health endpoints: Your load balancer or Kubernetes readiness probe hits
/healthor/ready. These are standard FastAPI routes you control entirely.
The graceful_shutdown_timeout_s parameter gives in-flight requests 30 seconds to complete before Ray forcefully kills the replica. Set this based on your worst-case inference time.
Testing the Endpoint
Once your service is running, test it with curl:
| |
Expected output from the predict endpoint:
| |
You can also check autoscaling behavior by sending concurrent requests:
| |
Watch replicas scale up with serve status in another terminal. You should see the replica count climb as concurrent requests increase past the target_ongoing_requests threshold.
Common Errors and Fixes
RayServeException: Cannot call __init__ on a deployment handle
You called .bind() but passed arguments incorrectly. Make sure constructor arguments go inside .bind(arg1, arg2), not in the decorator.
TypeError: check_health() must be a sync function
The check_health method cannot be async. Ray calls it in a synchronous context. Remove async from the method definition.
ValueError: Batch size mismatch - expected N results, got M
Your @serve.batch method must return exactly one result per input. If your model returns a different number of outputs than inputs, you have a bug in your batching logic. Double-check that len(results) == len(texts) before returning.
RuntimeError: No available node to schedule this deployment
You requested more resources (GPUs/CPUs) than your Ray cluster has. Either reduce max_replicas, reduce num_gpus in ray_actor_options, or add more nodes to the cluster with ray start --address=<head-node>:6379.
Replicas not scaling down after traffic drops
This is usually downscale_delay_s doing its job. The default is 600 seconds (10 minutes). If you want faster downscaling, reduce this value. Setting it below 30 seconds risks thrashing in bursty traffic patterns.
ImportError: cannot import name 'serve' from 'ray'
You need the serve extra: pip install "ray[serve]". The base ray package does not include Ray Serve.
Related Guides
- How to Build a Model Input Validation Pipeline with Pydantic and FastAPI
- How to Serve ML Models with BentoML and Build Prediction APIs
- How to Build a Shadow Deployment Pipeline for ML Models
- How to Build a Model Feature Store Pipeline with Redis and FastAPI
- How to Build a Model Rollback Pipeline with Health Checks
- How to Build a Model Metadata Store with SQLite and FastAPI
- How to Build Blue-Green Deployments for ML Models
- How to Build a Model Dependency Scanner and Vulnerability Checker
- How to Build a Model Health Dashboard with FastAPI and SQLite
- How to Build a Model Batch Inference Pipeline with Ray and Parquet