ONNX Runtime takes your PyTorch or Hugging Face model, applies graph-level optimizations like operator fusion and constant folding, and runs inference through hardware-specific backends. The result: 2-4x speedup on CPU, lower latency on GPU, and no PyTorch dependency at serving time.
Here is the fastest path from a Hugging Face checkpoint to an optimized ONNX model running inference.
Export a Hugging Face Model to ONNX
The optimum-onnx library wraps the export and inference pipeline. Install it alongside ONNX Runtime:
| |
Export any supported Transformers model with the CLI:
| |
This creates model.onnx plus tokenizer files in ./onnx_distilbert/. The CLI handles dynamic axes (variable batch size and sequence length) automatically.
For programmatic export, load with ORTModelForSequenceClassification and set export=True:
| |
The ORTModel* classes are drop-in replacements for AutoModel*. Your existing inference code barely changes.
Configure Graph Optimizations
ONNX Runtime applies graph transformations at session creation time. These fuse operations, eliminate redundant nodes, and rewrite subgraphs for the target hardware. You control this through SessionOptions:
| |
The four optimization levels:
- ORT_DISABLE_ALL – no optimizations, useful for debugging
- ORT_ENABLE_BASIC – constant folding, redundant node elimination, Conv+BatchNorm fusion
- ORT_ENABLE_EXTENDED – GELU fusion, layer normalization fusion, attention fusion (runs after graph partitioning)
- ORT_ENABLE_ALL – everything above plus layout optimizations like NCHW to NCHWc on CPU
For transformer models, ORT_ENABLE_ALL is almost always what you want. The extended optimizations fuse multi-node attention patterns into single optimized kernels, which is where most of the speedup comes from.
Pick the Right Execution Provider
Execution providers are ONNX Runtime’s abstraction for hardware backends. The provider list is a priority order – ORT tries each one and falls back to the next:
| |
If you install onnxruntime-gpu but CUDAExecutionProvider doesn’t show up, you have a CUDA version mismatch. ONNX Runtime 1.19+ requires CUDA 12.x and cuDNN 9.x. Check your versions:
| |
Quantize for Even Faster Inference
Dynamic quantization converts float32 weights to int8 at load time and quantizes activations on the fly. It needs no calibration data and works well for transformer models:
| |
Load the quantized model identically:
| |
Expect 1.5-2x additional speedup on modern CPUs with AVX-512 or VNNI instructions. On GPU, INT8 quantization benefits require Tensor Core hardware (T4, A100, or newer). Older GPUs like V100 won’t see gains from quantization.
Benchmark Before and After
Always measure. Here is a minimal benchmark comparing PyTorch eager, ONNX Runtime, and ONNX Runtime quantized:
| |
On an Intel Xeon with AVX-512 you’ll typically see 2-3x speedup for the base ONNX model, and 3-4x after INT8 quantization. On GPU, expect 20-40% improvement over PyTorch eager for small batch sizes where kernel launch overhead matters most.
Common Errors and Fixes
Wrong input data type
| |
ONNX Runtime expects float32 inputs. Cast your numpy arrays explicitly:
| |
Mismatched input names
| |
Input names must match exactly. Inspect them with:
| |
CUDA provider not found
If ort.get_available_providers() only shows CPUExecutionProvider despite installing onnxruntime-gpu, you likely have conflicting packages. Fix it:
| |
You cannot have both onnxruntime and onnxruntime-gpu installed simultaneously – they conflict. Pick one.
Unsupported opset version
| |
Your ONNX Runtime version is too old for the model’s opset. Either upgrade ONNX Runtime or re-export with a lower opset:
| |
When ONNX Runtime Is Worth It
Use ONNX Runtime when you need to serve models in production without a PyTorch dependency, when you’re running on CPU and need every millisecond, or when you want a single model format that runs on CUDA, TensorRT, OpenVINO, CoreML, and DirectML without code changes.
Skip it if you’re doing research and retraining constantly – the export-optimize-validate loop adds friction. Also skip it for models with lots of custom operators or dynamic control flow that won’t export cleanly to ONNX.
For the common case of deploying a fine-tuned transformer to a CPU-based API server, ONNX Runtime with INT8 quantization is one of the highest-impact optimizations you can make with the least amount of work.
Related Guides
- How to Compile and Optimize PyTorch Models with torch.compile
- How to Deploy Models to Edge Devices with ONNX and TensorRT
- How to Optimize Docker Images for ML Model Serving
- How to Build a Model Serving Cluster with Ray Serve and Docker
- How to Scale ML Training and Inference with Ray
- How to Build a Model Inference Cost Tracking Pipeline with OpenTelemetry
- How to Serve ML Models with NVIDIA Triton Inference Server
- How to Build a Model Inference Queue with Celery and Redis
- How to Build a Model Training Checkpoint Pipeline with PyTorch
- How to Profile and Optimize GPU Memory for LLM Training