Batch inference is the workhorse of production ML. You have millions of rows sitting in Parquet files, a trained model, and you need predictions on all of them. Running that sequentially on a single core takes hours. Ray Data makes it take minutes by streaming the data through your model across as many workers as you have available.
Here’s the fastest path from Parquet files to prediction columns:
| |
| |
That reads Parquet, fans the data out to 4 workers each running a sentiment classifier, and writes the results back to Parquet with two new columns: label and score. The entire pipeline streams – Ray never loads the full dataset into memory at once.
Reading Large Datasets from Parquet
Ray Data reads Parquet files lazily. It scans the file metadata first, figures out the row groups, and streams blocks through your pipeline on demand. This is the key difference from loading everything into a Pandas DataFrame – you can process terabytes without OOM-killing your machine.
For a local directory of Parquet files:
| |
For partitioned datasets on S3, just point at the parent directory. Ray picks up every .parquet file recursively:
| |
The ray_remote_args parameter controls how many resources each read task uses. Set num_cpus lower than 1 if you want more read parallelism without being bottlenecked by CPU reservations.
If you only need specific columns, push the selection down to the reader. This skips reading unnecessary column data from disk entirely:
| |
Running Batch Predictions with map_batches
The map_batches method is where the real work happens. You define a class that loads the model once in __init__ and processes pandas-style batches in __call__. Ray instantiates your class across workers and feeds batches to them.
Here’s a full example with a zero-shot classifier that handles GPU assignment:
| |
A few things to notice:
concurrency=2spins up 2 actor replicas. Each one loads the model independently, so you need enough GPU memory for 2 copies.num_gpus=1tells Ray each actor needs one GPU. Ray assigns GPUs automatically – you don’t pick device IDs yourself.batch_size=32controls how many rows are in each pandas batch passed to__call__. This is separate from the model’s internalbatch_sizeparameter. Tune both for throughput.device=0in the pipeline constructor works because Ray ensures each actor sees only its assigned GPU as device 0 viaCUDA_VISIBLE_DEVICES.
For CPU-only inference, drop num_gpus and set device=-1 in the pipeline constructor. You can scale to dozens of CPU workers cheaply this way.
Monitoring Progress and Handling Failures
Ray Data prints progress bars by default showing blocks processed, throughput, and memory usage. For more control, use ray.data.DataContext:
| |
For long-running jobs, you want fault tolerance. If one worker crashes mid-batch, you don’t want to restart the entire pipeline. Enable retries on the map_batches call:
| |
Ray will retry failed batches up to 3 times on a different worker. This handles transient GPU errors, OOM on a single batch, or network blips when reading from S3.
For checkpointing, write intermediate results in partitions. If you’re processing 100 million rows, break the job into chunks:
| |
This way, if the job fails on partition 47, you resume from partition 47 instead of starting over.
Common Errors and Fixes
OutOfMemoryError during inference
Your batch size is too large for the GPU. Reduce batch_size in map_batches and the model’s internal batch size. Start with batch_size=8 and increase until you hit 80-90% GPU utilization.
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task
The model failed to load. Common causes: the model doesn’t fit on the GPU, the HuggingFace model name is wrong, or a dependency is missing. Test your predictor class locally first:
| |
ValueError: Batch type <class 'dict'> is not supported
By default Ray Data passes pandas DataFrames to __call__. If you’re getting dicts instead, set the batch format explicitly:
| |
Slow throughput despite multiple GPUs
Check if your pipeline is bottlenecked on reading. Add more read parallelism:
| |
Also check that concurrency matches your GPU count. If you have 4 GPUs but concurrency=1, three GPUs sit idle.
FileNotFoundError when writing back to S3
Make sure you have s3fs installed and AWS credentials are configured:
| |
Related Guides
- How to Build a Model Dependency Scanner and Vulnerability Checker
- How to Build a Model Input Validation Pipeline with Pydantic and FastAPI
- How to Build a Shadow Deployment Pipeline for ML Models
- How to Build a Model Configuration Management Pipeline with Hydra
- How to Build a Model Metadata Store with SQLite and FastAPI
- How to Build a Model Serving Pipeline with Ray Serve and FastAPI
- How to Build a Model Health Dashboard with FastAPI and SQLite
- How to Build a Model Feature Store Pipeline with Redis and FastAPI
- How to Build a Model Performance Alerting Pipeline with Webhooks
- How to Build a Model Rollback Pipeline with Health Checks