Deploy Your First Model in 5 Minutes
Triton Inference Server is the fastest way to serve production ML models at scale. It supports PyTorch, TensorFlow, ONNX, TensorRT, and custom backends, all with built-in dynamic batching and GPU optimization.
Start the Triton server with Docker:
| |
This exposes three ports: HTTP (8000), gRPC (8001), and metrics (8002). The --gpus all flag gives Triton access to your GPUs. If you don’t have a GPU, remove that flag and Triton will run on CPU (slower, but works for testing).
Setting Up the Model Repository
Triton expects models in a specific directory structure. Each model gets its own folder with version subdirectories:
| |
The version number (1, 2, 3…) lets you deploy multiple versions simultaneously. Triton serves the highest version by default, but clients can request specific versions.
Create this structure for a PyTorch text classification model:
| |
Now create model_repository/text_classifier/config.pbtxt:
| |
The -1 in dims means that dimension can vary (batch size). The dynamic_batching section tells Triton to wait up to 5ms to collect requests into batches of 4, 8, or 16 before running inference. This dramatically improves throughput.
Choosing the Right Backend
PyTorch: Use this if you’re starting out or need maximum flexibility. It’s the easiest to set up because you can export models with torch.jit.save() and they just work. Performance is good but not optimal.
ONNX: The middle ground. Convert your PyTorch or TensorFlow model to ONNX for better cross-platform compatibility and slightly faster inference. Great for models you’re shipping to clients or running on CPU.
TensorRT: The performance king for NVIDIA GPUs. Converting to TensorRT takes work (you need to specify input shapes, precision modes, etc.), but you’ll get 2-5x speedup on GPUs. Use this for production deployments where latency matters.
Here’s how to export a PyTorch model to TorchScript (Triton’s PyTorch backend):
| |
For ONNX export:
| |
Update your config.pbtxt backend to backend: "onnxruntime" and change the model filename to model.onnx.
Dynamic Batching for Throughput
Dynamic batching is Triton’s killer feature. Instead of processing one request at a time, Triton waits for a short period (microseconds) to accumulate requests, then batches them together. This maxes out GPU utilization.
Add this to any config.pbtxt:
| |
preferred_batch_size: Triton will try to form batches of these sizesmax_queue_delay_microseconds: Maximum time to wait before sending a partial batchpreserve_ordering: false: Responses may arrive out of order (faster, fine for stateless inference)
Start with 5ms delay. If you’re getting low throughput, increase it to 10-20ms. If latency is too high, reduce it to 1-2ms. The sweet spot depends on your request rate.
Sending Inference Requests
Install the Triton client:
| |
Send requests via HTTP:
| |
For production, use gRPC instead of HTTP. It’s 20-30% faster:
| |
Model Ensembles for Pipelines
Ensembles let you chain models together on the server side. For example, a text pipeline might run: tokenizer → embedding model → classifier. Triton handles the data flow.
Create model_repository/text_pipeline/config.pbtxt:
| |
The input_map and output_map wire the models together. You send one request to text_pipeline, Triton runs both models, and returns the final output. This reduces network overhead and simplifies client code.
Monitoring with Prometheus
Triton exposes metrics on port 8002. Scrape them with Prometheus:
| |
Key metrics to watch:
nv_inference_request_success/nv_inference_request_failure: Request counts per modelnv_inference_request_duration_us: Latency percentiles (p50, p95, p99)nv_inference_queue_duration_us: Time spent waiting in dynamic batching queuenv_gpu_utilization: GPU usage per modelnv_gpu_memory_used_bytes: GPU memory consumption
If nv_inference_queue_duration_us is high, your max_queue_delay_microseconds is too long or your GPU is saturated. If nv_gpu_utilization is low, increase batch sizes or reduce queue delay.
Query live metrics:
| |
You’ll see output like:
| |
Common Errors and Fixes
Error: “Model not ready”
Triton is still loading the model. Large models can take 10-30 seconds to load. Check docker logs <container-id> for progress. If it fails to load, you’ll see errors like “Failed to load model” with details about what’s wrong (usually shape mismatches or missing files).
Error: “Invalid input shape”
Your input dimensions don’t match config.pbtxt. Double-check the dims field and ensure you’re sending data with the right shape. Remember, Triton adds the batch dimension automatically, so a dims: [128] input expects shape (batch_size, 128).
Error: “All model instances are unavailable”
All GPU instances are busy. Increase instance_group count in config.pbtxt:
| |
This creates 2 instances of the model on GPU. Each instance can handle one request at a time (or one batch with dynamic batching).
Low throughput even with dynamic batching Check your request rate. Dynamic batching only helps if you have multiple concurrent requests. If you’re sending requests sequentially (wait for response, then send next), you won’t see any batching. Use multiple threads or async clients to send concurrent requests.
GPU out of memory
Reduce max_batch_size in config.pbtxt or decrease the number of instances. You can also enable model unloading for models that aren’t frequently used.
Performance Tuning Tips
Use TensorRT for GPU inference. The conversion process is annoying, but the speedup is worth it for production workloads. Start with PyTorch or ONNX for development, then convert to TensorRT once you’ve validated correctness.
Enable FP16 precision if your model supports it. Add to config.pbtxt:
| |
This enables CUDA graphs, which reduce kernel launch overhead. You can also enable FP16 in TensorRT builds.
Set max_queue_delay_microseconds based on your latency requirements. For interactive applications (chatbots), keep it under 5ms. For batch processing (transcription jobs), you can go up to 50-100ms to maximize throughput.
Monitor nv_inference_queue_duration_us and nv_inference_compute_duration_us separately. If queue duration is high, your batching settings are wrong. If compute duration is high, your model is slow or your GPU is underpowered.
Related Guides
- How to Optimize Docker Images for ML Model Serving
- How to Build a Model Serving Cluster with Ray Serve and Docker
- How to Build a Model Inference Cost Tracking Pipeline with OpenTelemetry
- How to Compile and Optimize PyTorch Models with torch.compile
- How to Optimize Model Inference with ONNX Runtime
- How to Deploy Models to Edge Devices with ONNX and TensorRT
- How to Set Up a GPU Cluster with Slurm for ML Training
- How to Run LLMs Locally with Ollama and llama.cpp
- How to Build a Model Training Pipeline with Lightning Fabric
- How to Build a Model Training Dashboard with TensorBoard and Prometheus