How to Serve ML Models with NVIDIA Triton Inference Server

Deploy Your First Model in 5 Minutes

Triton Inference Server is the fastest way to serve production ML models at scale. It supports PyTorch, TensorFlow, ONNX, TensorRT, and custom backends, all with built-in dynamic batching and GPU optimization.

Start the Triton server with Docker:

1
2
3
4
5
docker run --gpus all --rm \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $(pwd)/model_repository:/models \
  nvcr.io/nvidia/tritonserver:25.01-py3 \
  tritonserver --model-repository=/models

This exposes three ports: HTTP (8000), gRPC (8001), and metrics (8002). The --gpus all flag gives Triton access to your GPUs. If you don’t have a GPU, remove that flag and Triton will run on CPU (slower, but works for testing).

Setting Up the Model Repository

Triton expects models in a specific directory structure. Each model gets its own folder with version subdirectories:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model_repository/
├── text_classifier/
│   ├── config.pbtxt
│   └── 1/
│       └── model.pt
├── image_detector/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx
└── text_embedder/
    ├── config.pbtxt
    └── 1/
        └── model.plan

The version number (1, 2, 3…) lets you deploy multiple versions simultaneously. Triton serves the highest version by default, but clients can request specific versions.

Create this structure for a PyTorch text classification model:

1
2
3
mkdir -p model_repository/text_classifier/1
# Copy your PyTorch model
cp my_model.pt model_repository/text_classifier/1/model.pt

Now create model_repository/text_classifier/config.pbtxt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
name: "text_classifier"
backend: "pytorch"
max_batch_size: 32

input [
  {
    name: "INPUT_IDS"
    data_type: TYPE_INT64
    dims: [-1, 128]
  },
  {
    name: "ATTENTION_MASK"
    data_type: TYPE_INT64
    dims: [-1, 128]
  }
]

output [
  {
    name: "LOGITS"
    data_type: TYPE_FP32
    dims: [-1, 2]
  }
]

dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 5000
}

The -1 in dims means that dimension can vary (batch size). The dynamic_batching section tells Triton to wait up to 5ms to collect requests into batches of 4, 8, or 16 before running inference. This dramatically improves throughput.

Choosing the Right Backend

PyTorch: Use this if you’re starting out or need maximum flexibility. It’s the easiest to set up because you can export models with torch.jit.save() and they just work. Performance is good but not optimal.

ONNX: The middle ground. Convert your PyTorch or TensorFlow model to ONNX for better cross-platform compatibility and slightly faster inference. Great for models you’re shipping to clients or running on CPU.

TensorRT: The performance king for NVIDIA GPUs. Converting to TensorRT takes work (you need to specify input shapes, precision modes, etc.), but you’ll get 2-5x speedup on GPUs. Use this for production deployments where latency matters.

Here’s how to export a PyTorch model to TorchScript (Triton’s PyTorch backend):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import torch
from transformers import AutoModelForSequenceClassification

# Load a pretrained model (or your custom model)
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
model.eval()

# Create example inputs matching the model's expected shape
example_input_ids = torch.randint(0, 30000, (1, 128))
example_attention_mask = torch.ones((1, 128), dtype=torch.long)

# Trace and save as TorchScript
traced_model = torch.jit.trace(model, (example_input_ids, example_attention_mask))
traced_model.save('model_repository/text_classifier/1/model.pt')

For ONNX export:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import torch.onnx

model.eval()
torch.onnx.export(
    model,
    (example_input_ids, example_attention_mask),
    'model_repository/text_classifier/1/model.onnx',
    input_names=['INPUT_IDS', 'ATTENTION_MASK'],
    output_names=['LOGITS'],
    dynamic_axes={
        'INPUT_IDS': {0: 'batch_size'},
        'ATTENTION_MASK': {0: 'batch_size'},
        'LOGITS': {0: 'batch_size'}
    }
)

Update your config.pbtxt backend to backend: "onnxruntime" and change the model filename to model.onnx.

Dynamic Batching for Throughput

Dynamic batching is Triton’s killer feature. Instead of processing one request at a time, Triton waits for a short period (microseconds) to accumulate requests, then batches them together. This maxes out GPU utilization.

Add this to any config.pbtxt:

1
2
3
4
5
dynamic_batching {
  preferred_batch_size: [8, 16, 32]
  max_queue_delay_microseconds: 5000
  preserve_ordering: false
}

preferred_batch_size: Triton will try to form batches of these sizes
max_queue_delay_microseconds: Maximum time to wait before sending a partial batch
preserve_ordering: false: Responses may arrive out of order (faster, fine for stateless inference)

Start with 5ms delay. If you’re getting low throughput, increase it to 10-20ms. If latency is too high, reduce it to 1-2ms. The sweet spot depends on your request rate.

Sending Inference Requests

Install the Triton client:

1
pip install tritonclient[all]

Send requests via HTTP:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

# Prepare inputs
input_ids = np.random.randint(0, 30000, (1, 128)).astype(np.int64)
attention_mask = np.ones((1, 128), dtype=np.int64)

inputs = [
    httpclient.InferInput("INPUT_IDS", input_ids.shape, "INT64"),
    httpclient.InferInput("ATTENTION_MASK", attention_mask.shape, "INT64")
]
inputs[0].set_data_from_numpy(input_ids)
inputs[1].set_data_from_numpy(attention_mask)

outputs = [httpclient.InferRequestedOutput("LOGITS")]

# Send request
response = client.infer("text_classifier", inputs=inputs, outputs=outputs)
logits = response.as_numpy("LOGITS")
print(f"Logits shape: {logits.shape}")

For production, use gRPC instead of HTTP. It’s 20-30% faster:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="localhost:8001")

input_ids = np.random.randint(0, 30000, (1, 128)).astype(np.int64)
attention_mask = np.ones((1, 128), dtype=np.int64)

inputs = [
    grpcclient.InferInput("INPUT_IDS", input_ids.shape, "INT64"),
    grpcclient.InferInput("ATTENTION_MASK", attention_mask.shape, "INT64")
]
inputs[0].set_data_from_numpy(input_ids)
inputs[1].set_data_from_numpy(attention_mask)

outputs = [grpcclient.InferRequestedOutput("LOGITS")]

response = client.infer("text_classifier", inputs=inputs, outputs=outputs)
logits = response.as_numpy("LOGITS")

Model Ensembles for Pipelines

Ensembles let you chain models together on the server side. For example, a text pipeline might run: tokenizer → embedding model → classifier. Triton handles the data flow.

Create model_repository/text_pipeline/config.pbtxt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
name: "text_pipeline"
platform: "ensemble"
max_batch_size: 32

input [
  {
    name: "TEXT"
    data_type: TYPE_STRING
    dims: [1]
  }
]

output [
  {
    name: "CLASSIFICATION"
    data_type: TYPE_FP32
    dims: [2]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "tokenizer"
      model_version: -1
      input_map {
        key: "TEXT"
        value: "TEXT"
      }
      output_map {
        key: "INPUT_IDS"
        value: "input_ids"
      }
      output_map {
        key: "ATTENTION_MASK"
        value: "attention_mask"
      }
    },
    {
      model_name: "text_classifier"
      model_version: -1
      input_map {
        key: "INPUT_IDS"
        value: "input_ids"
      }
      input_map {
        key: "ATTENTION_MASK"
        value: "attention_mask"
      }
      output_map {
        key: "LOGITS"
        value: "CLASSIFICATION"
      }
    }
  ]
}

The input_map and output_map wire the models together. You send one request to text_pipeline, Triton runs both models, and returns the final output. This reduces network overhead and simplifies client code.

Monitoring with Prometheus

Triton exposes metrics on port 8002. Scrape them with Prometheus:

1
2
3
4
5
# prometheus.yml
scrape_configs:
  - job_name: 'triton'
    static_configs:
      - targets: ['localhost:8002']

Key metrics to watch:

nv_inference_request_success / nv_inference_request_failure: Request counts per model
nv_inference_request_duration_us: Latency percentiles (p50, p95, p99)
nv_inference_queue_duration_us: Time spent waiting in dynamic batching queue
nv_gpu_utilization: GPU usage per model
nv_gpu_memory_used_bytes: GPU memory consumption

If nv_inference_queue_duration_us is high, your max_queue_delay_microseconds is too long or your GPU is saturated. If nv_gpu_utilization is low, increase batch sizes or reduce queue delay.

Query live metrics:

1
curl localhost:8002/metrics | grep nv_inference_request_success

You’ll see output like:

1
nv_inference_request_success{model="text_classifier",version="1"} 1523

Common Errors and Fixes

Error: “Model not ready” Triton is still loading the model. Large models can take 10-30 seconds to load. Check docker logs <container-id> for progress. If it fails to load, you’ll see errors like “Failed to load model” with details about what’s wrong (usually shape mismatches or missing files).

Error: “Invalid input shape” Your input dimensions don’t match config.pbtxt. Double-check the dims field and ensure you’re sending data with the right shape. Remember, Triton adds the batch dimension automatically, so a dims: [128] input expects shape (batch_size, 128).

Error: “All model instances are unavailable” All GPU instances are busy. Increase instance_group count in config.pbtxt:

1
2
3
4
5
6
instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]

This creates 2 instances of the model on GPU. Each instance can handle one request at a time (or one batch with dynamic batching).

Low throughput even with dynamic batching Check your request rate. Dynamic batching only helps if you have multiple concurrent requests. If you’re sending requests sequentially (wait for response, then send next), you won’t see any batching. Use multiple threads or async clients to send concurrent requests.

GPU out of memory Reduce max_batch_size in config.pbtxt or decrease the number of instances. You can also enable model unloading for models that aren’t frequently used.

Performance Tuning Tips

Use TensorRT for GPU inference. The conversion process is annoying, but the speedup is worth it for production workloads. Start with PyTorch or ONNX for development, then convert to TensorRT once you’ve validated correctness.

Enable FP16 precision if your model supports it. Add to config.pbtxt:

1
2
3
4
5
optimization {
  cuda {
    graphs: true
  }
}

This enables CUDA graphs, which reduce kernel launch overhead. You can also enable FP16 in TensorRT builds.

Set max_queue_delay_microseconds based on your latency requirements. For interactive applications (chatbots), keep it under 5ms. For batch processing (transcription jobs), you can go up to 50-100ms to maximize throughput.

Monitor nv_inference_queue_duration_us and nv_inference_compute_duration_us separately. If queue duration is high, your batching settings are wrong. If compute duration is high, your model is slow or your GPU is underpowered.

Deploy Your First Model in 5 Minutes#

Setting Up the Model Repository#

Choosing the Right Backend#

Dynamic Batching for Throughput#

Sending Inference Requests#

Model Ensembles for Pipelines#

Monitoring with Prometheus#

Common Errors and Fixes#

Performance Tuning Tips#

Related Guides#

About the Author