How to Optimize Model Inference with ONNX Runtime

ONNX Runtime takes your PyTorch or Hugging Face model, applies graph-level optimizations like operator fusion and constant folding, and runs inference through hardware-specific backends. The result: 2-4x speedup on CPU, lower latency on GPU, and no PyTorch dependency at serving time.

Here is the fastest path from a Hugging Face checkpoint to an optimized ONNX model running inference.

Export a Hugging Face Model to ONNX

The optimum-onnx library wraps the export and inference pipeline. Install it alongside ONNX Runtime:

1
2
3
pip install optimum[onnxruntime] onnxruntime-gpu
# If you only need CPU inference:
# pip install optimum[onnxruntime] onnxruntime

Export any supported Transformers model with the CLI:

1
optimum-cli export onnx --model distilbert-base-uncased-finetuned-sst-2-english ./onnx_distilbert/

This creates model.onnx plus tokenizer files in ./onnx_distilbert/. The CLI handles dynamic axes (variable batch size and sequence length) automatically.

For programmatic export, load with ORTModelForSequenceClassification and set export=True:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
save_dir = "./onnx_distilbert"

# Export and save
model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

# Run inference -- same API as Transformers
inputs = tokenizer("ONNX Runtime is fast", return_tensors="np")
outputs = model(**inputs)
print(outputs.logits)
# [[-3.42  3.58]]  -- positive sentiment

The ORTModel* classes are drop-in replacements for AutoModel*. Your existing inference code barely changes.

Configure Graph Optimizations

ONNX Runtime applies graph transformations at session creation time. These fuse operations, eliminate redundant nodes, and rewrite subgraphs for the target hardware. You control this through SessionOptions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import onnxruntime as ort

sess_options = ort.SessionOptions()

# ORT_ENABLE_ALL applies basic + extended + layout optimizations
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL

# Save the optimized graph to disk (useful for debugging or TensorRT)
sess_options.optimized_model_filepath = "./optimized_model.onnx"

# Enable parallel execution of independent operators
sess_options.execution_mode = ort.ExecutionMode.ORT_PARALLEL
sess_options.inter_op_num_threads = 4
sess_options.intra_op_num_threads = 4

session = ort.InferenceSession(
    "./onnx_distilbert/model.onnx",
    sess_options=sess_options,
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

The four optimization levels:

ORT_DISABLE_ALL – no optimizations, useful for debugging
ORT_ENABLE_BASIC – constant folding, redundant node elimination, Conv+BatchNorm fusion
ORT_ENABLE_EXTENDED – GELU fusion, layer normalization fusion, attention fusion (runs after graph partitioning)
ORT_ENABLE_ALL – everything above plus layout optimizations like NCHW to NCHWc on CPU

For transformer models, ORT_ENABLE_ALL is almost always what you want. The extended optimizations fuse multi-node attention patterns into single optimized kernels, which is where most of the speedup comes from.

Pick the Right Execution Provider

Execution providers are ONNX Runtime’s abstraction for hardware backends. The provider list is a priority order – ORT tries each one and falls back to the next:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# GPU inference with CUDA
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]

# GPU inference with TensorRT (highest performance on NVIDIA)
providers = [
    ("TensorRTExecutionProvider", {
        "trt_max_workspace_size": 2 << 30,  # 2 GB
        "trt_fp16_enable": True,
    }),
    "CUDAExecutionProvider",
    "CPUExecutionProvider",
]

# CPU-only (Intel OpenVINO)
providers = ["OpenVINOExecutionProvider", "CPUExecutionProvider"]

# Check what's available on your machine
print(ort.get_available_providers())
# ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

If you install onnxruntime-gpu but CUDAExecutionProvider doesn’t show up, you have a CUDA version mismatch. ONNX Runtime 1.19+ requires CUDA 12.x and cuDNN 9.x. Check your versions:

1
2
3
4
5
python -c "import onnxruntime; print(onnxruntime.get_device())"
# Should print 'GPU' if CUDA provider is available

nvidia-smi  # Check CUDA driver version
python -c "import torch; print(torch.version.cuda)"  # CUDA toolkit version

Quantize for Even Faster Inference

Dynamic quantization converts float32 weights to int8 at load time and quantizes activations on the fly. It needs no calibration data and works well for transformer models:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig

quantizer = ORTQuantizer.from_pretrained("./onnx_distilbert")

# Dynamic quantization -- no calibration dataset needed
qconfig = AutoQuantizationConfig.avx512_vnni(
    is_static=False,
    per_channel=True,
)

quantizer.quantize(save_dir="./onnx_distilbert_quantized", quantization_config=qconfig)

Load the quantized model identically:

1
2
3
model = ORTModelForSequenceClassification.from_pretrained("./onnx_distilbert_quantized")
inputs = tokenizer("Still works the same way", return_tensors="np")
outputs = model(**inputs)

Expect 1.5-2x additional speedup on modern CPUs with AVX-512 or VNNI instructions. On GPU, INT8 quantization benefits require Tensor Core hardware (T4, A100, or newer). Older GPUs like V100 won’t see gains from quantization.

Benchmark Before and After

Always measure. Here is a minimal benchmark comparing PyTorch eager, ONNX Runtime, and ONNX Runtime quantized:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import time
import numpy as np
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
text = "ONNX Runtime significantly reduces inference latency"
inputs_pt = tokenizer(text, return_tensors="pt")
inputs_np = tokenizer(text, return_tensors="np")

# PyTorch eager
pt_model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
)
pt_model.eval()

# Warmup
for _ in range(10):
    pt_model(**inputs_pt)

start = time.perf_counter()
for _ in range(100):
    pt_model(**inputs_pt)
pt_time = (time.perf_counter() - start) / 100

# ONNX Runtime
from optimum.onnxruntime import ORTModelForSequenceClassification
ort_model = ORTModelForSequenceClassification.from_pretrained("./onnx_distilbert")

for _ in range(10):
    ort_model(**inputs_np)

start = time.perf_counter()
for _ in range(100):
    ort_model(**inputs_np)
ort_time = (time.perf_counter() - start) / 100

print(f"PyTorch eager:  {pt_time*1000:.1f} ms")
print(f"ONNX Runtime:   {ort_time*1000:.1f} ms")
print(f"Speedup:        {pt_time/ort_time:.2f}x")

On an Intel Xeon with AVX-512 you’ll typically see 2-3x speedup for the base ONNX model, and 3-4x after INT8 quantization. On GPU, expect 20-40% improvement over PyTorch eager for small batch sizes where kernel launch overhead matters most.

Common Errors and Fixes

Wrong input data type

1
2
3
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument:
[ONNXRuntimeError] : 2 : INVALID_ARGUMENT :
Unexpected input data type. Actual: (tensor(double)), expected: (tensor(float))

ONNX Runtime expects float32 inputs. Cast your numpy arrays explicitly:

1
inputs = {k: v.astype(np.float32) for k, v in inputs.items()}

Mismatched input names

1
2
ValueError: Required inputs (['input_ids', 'attention_mask']) are missing
from input feed (['input_id', 'attn_mask'])

Input names must match exactly. Inspect them with:

1
2
3
session = ort.InferenceSession("model.onnx")
for inp in session.get_inputs():
    print(f"{inp.name}: {inp.shape} ({inp.type})")

CUDA provider not found

If ort.get_available_providers() only shows CPUExecutionProvider despite installing onnxruntime-gpu, you likely have conflicting packages. Fix it:

1
2
pip uninstall onnxruntime onnxruntime-gpu
pip install onnxruntime-gpu

You cannot have both onnxruntime and onnxruntime-gpu installed simultaneously – they conflict. Pick one.

Unsupported opset version

1
2
3
onnxruntime.capi.onnxruntime_pybind11_state.Fail:
[ONNXRuntimeError] : 1 : FAIL : Load model from model.onnx failed:
This is an invalid ONNX model. Type Error: Opset 21 is not supported.

Your ONNX Runtime version is too old for the model’s opset. Either upgrade ONNX Runtime or re-export with a lower opset:

1
optimum-cli export onnx --model your-model --opset 17 ./output_dir/

When ONNX Runtime Is Worth It

Use ONNX Runtime when you need to serve models in production without a PyTorch dependency, when you’re running on CPU and need every millisecond, or when you want a single model format that runs on CUDA, TensorRT, OpenVINO, CoreML, and DirectML without code changes.

Skip it if you’re doing research and retraining constantly – the export-optimize-validate loop adds friction. Also skip it for models with lots of custom operators or dynamic control flow that won’t export cleanly to ONNX.

For the common case of deploying a fine-tuned transformer to a CPU-based API server, ONNX Runtime with INT8 quantization is one of the highest-impact optimizations you can make with the least amount of work.

Export a Hugging Face Model to ONNX#

Configure Graph Optimizations#

Pick the Right Execution Provider#

Quantize for Even Faster Inference#

Benchmark Before and After#

Common Errors and Fixes#

Wrong input data type#

Mismatched input names#

CUDA provider not found#

Unsupported opset version#

When ONNX Runtime Is Worth It#

Related Guides#

About the Author