Most models never leave the cloud. That is a problem when your use case demands low latency at the edge – think robotics, drones, security cameras, or factory floors. The fastest path from a trained PyTorch model to an edge device like an NVIDIA Jetson is the ONNX-to-TensorRT pipeline. Here is how to do it end to end.
Export Your PyTorch Model to ONNX
Start with a trained PyTorch model. ONNX (Open Neural Network Exchange) is the interchange format that bridges PyTorch and TensorRT. The export is straightforward, but you need to get the details right.
| |
Use opset_version=17 or higher. Older opsets cause silent compatibility issues with TensorRT. The dynamic_axes argument lets you vary batch size at runtime, which matters when you want to benchmark different batch sizes later.
Validate the export immediately:
| |
If check_model passes and ONNX Runtime produces the right output shape, you are good. Do not skip this step – catching shape mismatches here saves hours of debugging on the device.
Convert ONNX to TensorRT
TensorRT is NVIDIA’s inference optimizer. It fuses layers, selects optimal kernels for your specific GPU, and handles precision conversion. Use trtexec, the command-line tool that ships with TensorRT.
| |
The --fp16 flag is the one you should reach for first. On Jetson Orin, FP16 typically gives you 2-3x speedup over FP32 with less than 0.5% accuracy loss. INT8 pushes that to 4-5x but requires a calibration dataset – a few hundred representative images fed through the model to determine quantization ranges.
My recommendation: Always start with FP16. Only go to INT8 if you have validated accuracy on your specific task and need the extra throughput. INT8 without proper calibration will silently degrade your model’s output quality.
Write a Calibration Script for INT8
If you do go the INT8 route, you need a calibration cache. Here is a Python calibrator using TensorRT’s API:
| |
Feed 300-500 representative images through the calibrator. More is not necessarily better – after about 500 images the quantization ranges stabilize.
Deploy and Run on Jetson
On your Jetson device (Orin Nano, Orin NX, or AGX Orin), copy the .engine file over and run inference with the TensorRT Python bindings:
| |
Always do a warmup phase. TensorRT’s first few inferences include JIT compilation and memory allocation that skew your numbers. Ten warmup iterations is usually enough.
Benchmark Across Precision Modes
Here are realistic numbers for ResNet50 on a Jetson Orin NX (batch size 1):
| Precision | Latency (ms) | Throughput (FPS) | Accuracy Drop |
|---|---|---|---|
| FP32 | 8.2 | 122 | baseline |
| FP16 | 3.1 | 323 | < 0.3% |
| INT8 | 1.8 | 556 | 0.5-1.5% |
FP16 is the sweet spot for most production workloads. The accuracy-speed tradeoff at INT8 only makes sense for high-throughput scenarios like multi-camera video analytics where you are processing 8+ streams simultaneously.
Common Errors
“Unsupported ONNX opset version” – You exported with an opset that is too new for your TensorRT version. Check your TensorRT version with dpkg -l | grep tensorrt on Jetson, then match the opset. TensorRT 8.6+ supports opset 17.
“Tensor shapes mismatch during engine deserialization” – The .engine file is hardware-specific. An engine built on your workstation GPU will not run on Jetson. Always build the engine on the target device, or cross-compile with the correct device flag.
“pycuda._driver.LogicError: cuMemAlloc failed” – Out of GPU memory. Reduce --workspace in trtexec, lower the max batch size, or close other GPU processes. On Jetson, run sudo jetson_clocks to maximize available resources.
“INVALID_CONFIG” during INT8 calibration – Your calibration dataset is empty or the images failed to load. Verify the image directory path and that images are valid JPEGs. Also confirm pycuda is properly installed with python3 -c "import pycuda.autoinit".
Engine build takes forever (30+ minutes) – Normal for the first build, especially with INT8. TensorRT tries thousands of kernel configurations. The resulting engine is cached, so subsequent loads are instant. Use --timingCacheFile=timing.cache to speed up rebuilds.
Tips for Production
Serialize your engine on the target device during a one-time setup step, not at every application startup. Engine deserialization takes 1-2 seconds versus 5-30 minutes for a full build.
Pin your TensorRT and JetPack versions. A JetPack upgrade will invalidate all your engine files because the CUDA compute kernels change. Track your JetPack version in your deployment manifest.
Use DLA (Deep Learning Accelerator) cores on Jetson Orin when available. They free up the GPU for other tasks:
| |
Not all layers are DLA-compatible, so --allowGPUFallback is mandatory. But for models like ResNet or MobileNet, DLA handles 80%+ of the layers and cuts GPU utilization significantly.
Related Guides
- How to Optimize Model Inference with ONNX Runtime
- How to Compile and Optimize PyTorch Models with torch.compile
- How to Serve ML Models with NVIDIA Triton Inference Server
- How to Build a Model Serving Cluster with Ray Serve and Docker
- How to Run LLMs Locally with Ollama and llama.cpp
- How to Optimize Docker Images for ML Model Serving
- How to Deploy DeepSeek R1 on NVIDIA Blackwell with vLLM’s Disaggregated Serving
- How to Set Up Multi-GPU Training with PyTorch
- How to Build a Model Training Pipeline with Lightning Fabric
- How to Set Up Distributed Training with DeepSpeed and ZeRO