How to Deploy Models to Edge Devices with ONNX and TensorRT

Most models never leave the cloud. That is a problem when your use case demands low latency at the edge – think robotics, drones, security cameras, or factory floors. The fastest path from a trained PyTorch model to an edge device like an NVIDIA Jetson is the ONNX-to-TensorRT pipeline. Here is how to do it end to end.

Export Your PyTorch Model to ONNX

Start with a trained PyTorch model. ONNX (Open Neural Network Exchange) is the interchange format that bridges PyTorch and TensorRT. The export is straightforward, but you need to get the details right.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import torch
import torchvision.models as models

# Load your trained model
model = models.resnet50(weights=None)
model.load_state_dict(torch.load("resnet50_trained.pth", map_location="cpu"))
model.eval()

# Create a dummy input matching your expected input shape
dummy_input = torch.randn(1, 3, 224, 224)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=17,          # Use 17+ for best TensorRT compatibility
    do_constant_folding=True,  # Folds constant ops for smaller graph
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
)

print("Exported model.onnx")

Use opset_version=17 or higher. Older opsets cause silent compatibility issues with TensorRT. The dynamic_axes argument lets you vary batch size at runtime, which matters when you want to benchmark different batch sizes later.

Validate the export immediately:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import onnx
import onnxruntime as ort
import numpy as np

# Check the model structure is valid
onnx_model = onnx.load("model.onnx")
onnx.checker.check_model(onnx_model)

# Run inference with ONNX Runtime to verify outputs
session = ort.InferenceSession("model.onnx")
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {"input": input_data})
print(f"Output shape: {outputs[0].shape}")  # Should be (1, 1000) for ResNet50

If check_model passes and ONNX Runtime produces the right output shape, you are good. Do not skip this step – catching shape mismatches here saves hours of debugging on the device.

Convert ONNX to TensorRT

TensorRT is NVIDIA’s inference optimizer. It fuses layers, selects optimal kernels for your specific GPU, and handles precision conversion. Use trtexec, the command-line tool that ships with TensorRT.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# FP16 conversion -- best balance of speed and accuracy
trtexec \
  --onnx=model.onnx \
  --saveEngine=model_fp16.engine \
  --fp16 \
  --workspace=4096 \
  --minShapes=input:1x3x224x224 \
  --optShapes=input:4x3x224x224 \
  --maxShapes=input:8x3x224x224 \
  --verbose

# INT8 conversion -- fastest, needs calibration data
trtexec \
  --onnx=model.onnx \
  --saveEngine=model_int8.engine \
  --int8 \
  --fp16 \
  --calib=calibration_cache.bin \
  --workspace=4096 \
  --minShapes=input:1x3x224x224 \
  --optShapes=input:4x3x224x224 \
  --maxShapes=input:8x3x224x224

The --fp16 flag is the one you should reach for first. On Jetson Orin, FP16 typically gives you 2-3x speedup over FP32 with less than 0.5% accuracy loss. INT8 pushes that to 4-5x but requires a calibration dataset – a few hundred representative images fed through the model to determine quantization ranges.

My recommendation: Always start with FP16. Only go to INT8 if you have validated accuracy on your specific task and need the extra throughput. INT8 without proper calibration will silently degrade your model’s output quality.

Write a Calibration Script for INT8

If you do go the INT8 route, you need a calibration cache. Here is a Python calibrator using TensorRT’s API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from PIL import Image
from pathlib import Path

class ImageCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, image_dir, batch_size=8, input_shape=(3, 224, 224)):
        super().__init__()
        self.batch_size = batch_size
        self.input_shape = input_shape
        self.images = list(Path(image_dir).glob("*.jpg"))[:500]
        self.current_index = 0
        self.device_input = cuda.mem_alloc(
            batch_size * int(np.prod(input_shape)) * 4
        )

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names):
        if self.current_index >= len(self.images):
            return None
        batch = []
        for i in range(self.batch_size):
            idx = self.current_index + i
            if idx >= len(self.images):
                break
            img = Image.open(self.images[idx]).resize((224, 224))
            arr = np.array(img).astype(np.float32).transpose(2, 0, 1) / 255.0
            batch.append(arr)
        self.current_index += self.batch_size
        batch = np.array(batch, dtype=np.float32)
        cuda.memcpy_htod(self.device_input, batch.tobytes())
        return [int(self.device_input)]

    def read_calibration_cache(self):
        cache_file = Path("calibration_cache.bin")
        if cache_file.exists():
            return cache_file.read_bytes()
        return None

    def write_calibration_cache(self, cache):
        Path("calibration_cache.bin").write_bytes(cache)

Feed 300-500 representative images through the calibrator. More is not necessarily better – after about 500 images the quantization ranges stabilize.

Deploy and Run on Jetson

On your Jetson device (Orin Nano, Orin NX, or AGX Orin), copy the .engine file over and run inference with the TensorRT Python bindings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import time

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)

def load_engine(engine_path):
    with open(engine_path, "rb") as f:
        runtime = trt.Runtime(TRT_LOGGER)
        return runtime.deserialize_cuda_engine(f.read())

def infer(engine, input_data):
    context = engine.create_execution_context()
    # Allocate device memory
    d_input = cuda.mem_alloc(input_data.nbytes)
    output_shape = (input_data.shape[0], 1000)
    output = np.empty(output_shape, dtype=np.float32)
    d_output = cuda.mem_alloc(output.nbytes)

    stream = cuda.Stream()
    cuda.memcpy_htod_async(d_input, input_data, stream)
    context.execute_async_v2(
        bindings=[int(d_input), int(d_output)], stream_handle=stream.handle
    )
    cuda.memcpy_dtoh_async(output, d_output, stream)
    stream.synchronize()
    return output

# Load and run
engine = load_engine("model_fp16.engine")
test_input = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Warmup -- first few runs are always slower
for _ in range(10):
    infer(engine, test_input)

# Benchmark
times = []
for _ in range(100):
    start = time.perf_counter()
    result = infer(engine, test_input)
    times.append(time.perf_counter() - start)

avg_ms = np.mean(times) * 1000
p99_ms = np.percentile(times, 99) * 1000
print(f"Average latency: {avg_ms:.2f} ms")
print(f"P99 latency:     {p99_ms:.2f} ms")
print(f"Throughput:      {1000 / avg_ms:.1f} FPS")

Always do a warmup phase. TensorRT’s first few inferences include JIT compilation and memory allocation that skew your numbers. Ten warmup iterations is usually enough.

Benchmark Across Precision Modes

Here are realistic numbers for ResNet50 on a Jetson Orin NX (batch size 1):

Precision	Latency (ms)	Throughput (FPS)	Accuracy Drop
FP32	8.2	122	baseline
FP16	3.1	323	< 0.3%
INT8	1.8	556	0.5-1.5%

FP16 is the sweet spot for most production workloads. The accuracy-speed tradeoff at INT8 only makes sense for high-throughput scenarios like multi-camera video analytics where you are processing 8+ streams simultaneously.

Common Errors

“Unsupported ONNX opset version” – You exported with an opset that is too new for your TensorRT version. Check your TensorRT version with dpkg -l | grep tensorrt on Jetson, then match the opset. TensorRT 8.6+ supports opset 17.

“Tensor shapes mismatch during engine deserialization” – The .engine file is hardware-specific. An engine built on your workstation GPU will not run on Jetson. Always build the engine on the target device, or cross-compile with the correct device flag.

“pycuda._driver.LogicError: cuMemAlloc failed” – Out of GPU memory. Reduce --workspace in trtexec, lower the max batch size, or close other GPU processes. On Jetson, run sudo jetson_clocks to maximize available resources.

“INVALID_CONFIG” during INT8 calibration – Your calibration dataset is empty or the images failed to load. Verify the image directory path and that images are valid JPEGs. Also confirm pycuda is properly installed with python3 -c "import pycuda.autoinit".

Engine build takes forever (30+ minutes) – Normal for the first build, especially with INT8. TensorRT tries thousands of kernel configurations. The resulting engine is cached, so subsequent loads are instant. Use --timingCacheFile=timing.cache to speed up rebuilds.

Tips for Production

Serialize your engine on the target device during a one-time setup step, not at every application startup. Engine deserialization takes 1-2 seconds versus 5-30 minutes for a full build.

Pin your TensorRT and JetPack versions. A JetPack upgrade will invalidate all your engine files because the CUDA compute kernels change. Track your JetPack version in your deployment manifest.

Use DLA (Deep Learning Accelerator) cores on Jetson Orin when available. They free up the GPU for other tasks:

1
2
3
4
5
6
trtexec \
  --onnx=model.onnx \
  --saveEngine=model_dla.engine \
  --fp16 \
  --useDLACore=0 \
  --allowGPUFallback

Not all layers are DLA-compatible, so --allowGPUFallback is mandatory. But for models like ResNet or MobileNet, DLA handles 80%+ of the layers and cuts GPU utilization significantly.

Export Your PyTorch Model to ONNX#

Convert ONNX to TensorRT#

Write a Calibration Script for INT8#

Deploy and Run on Jetson#

Benchmark Across Precision Modes#

Common Errors#

Tips for Production#

Related Guides#

About the Author