onnxruntime inference is way slower than pytorch on GPU
Asked Answered
J

2

6

I was comparing the inference times for an input using pytorch and onnxruntime and I find that onnxruntime is actually slower on GPU while being significantly faster on CPU

I was tryng this on Windows 10.

  • ONNX Runtime installed from source - ONNX Runtime version: 1.11.0 (onnx version 1.10.1)
  • Python version - 3.8.12
  • CUDA/cuDNN version - cuda version 11.5, cudnn version 8.2
  • GPU model and memory - Quadro M2000M, 4 GB

Relevant code -

import torch
from torchvision import models
import onnxruntime    # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time

batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
def convert_to_onnx(resnet):
   resnet.eval()
   dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
   input_names = [ 'input' ]
   output_names = [ 'output' ]
   torch.onnx.export(resnet, 
               dummy_input,
               "resnet18.onnx",
               verbose=True,
               opset_version=13,
               input_names=input_names,
               output_names=output_names,
               export_params=True,
               do_constant_folding=True,
               dynamic_axes={
                  'input': {0: 'batch_size'},  # variable length axes
                  'output': {0: 'batch_size'}}        
               )
                  
def infer_pytorch(resnet):
   print('Pytorch Inference')
   print('==========================')
   print()

   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)

   latency = []
   for i in range(total_samples):
      t0 = time.time()
      resnet.eval()
      with torch.no_grad():
         out = resnet(x)
      latency.append(time.time() - t0)

   print('Number of runs:', len(latency))
   print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))  

def to_numpy(tensor):
   return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def infer_onnxruntime():
   print('Onnxruntime Inference')
   print('==========================')
   print()

   onnx_model = onnx.load("resnet18.onnx")
   onnx.checker.check_model(onnx_model)

   # Input
   x = torch.randn((batch_size, 3, 224, 224))
   x = x.to(device=device)
   x = to_numpy(x)

   so = onnxruntime.SessionOptions()
   so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
   so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
   
   exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']

   model_onnx_path = os.path.join(".", "resnet18.onnx")
   ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)

   options = ort_session.get_provider_options()
   cuda_options = options['CUDAExecutionProvider']
   cuda_options['cudnn_conv_use_max_workspace'] = '1'
   ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])

   #IOBinding
   input_names = ort_session.get_inputs()[0].name
   output_names = ort_session.get_outputs()[0].name
   io_binding = ort_session.io_binding()

   io_binding.bind_cpu_input(input_names, x)
   io_binding.bind_output(output_names, device)
   
   #warm up run
   ort_session.run_with_iobinding(io_binding)
   ort_outs = io_binding.copy_outputs_to_cpu()

   latency = []

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
      ort_outs = io_binding.copy_outputs_to_cpu()
   print('Number of runs:', len(latency))
   print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))   

if __name__ == '__main__':
   torch.cuda.empty_cache()
   resnet = (models.resnet18(pretrained=True)).to(device=device)
   convert_to_onnx(resnet)
   infer_onnxruntime()
   infer_pytorch(resnet)

Output

If run on CPU,

Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms

but, if run on GPU, I see

Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms

If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.

I use io binding for the input tensor numpy array and the nodes of the model are on GPU.

Further, during the processing for onnxruntime, I print device usage stats and I see this -

Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached:    0.1 GB

So, GPU device is being used.

Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.

What am I doing wrong or missing here?

Jetta answered 17/1, 2022 at 11:3 Comment(1)
You should not use torch.cuda.empty_cache() as it it will slow down your code for no gain discuss.pytorch.org/t/…Protestant
P
1

When calculating inference time exclude all code that should be run once like resnet.eval() from the loop.

Please include imports in example

import torch
from torchvision import models
import onnxruntime    # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time

After running your example GPU only I found that time differs only ~x2, so the speed difference may be caused by framework characteristics. For more details explore onnx conversion optimization

Onnxruntime Inference
==========================

Number of runs: 1000
Average onnxruntime cuda Inference time = 4.76 ms
Pytorch Inference
==========================

Number of runs: 1000
Average PyTorch cuda Inference time = 2.27 ms
Protestant answered 17/1, 2022 at 14:54 Comment(3)
Thanks for the replyJetta
1) I have included the imports now, sorry about missing that earlier 2) The resnet.eval(), would anyway affect only pytorch inference time, and since in case of GPU inference the pytorch is already faster, excluding this won't explain why onnxruntime is slower 3) For your inference, did you run it on Windows 10 or linux? So if you disable the optimization for onnx, with so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL, do you actually see that onnxruntime is faster?Jetta
Regarding the link you shared, I had gone through it earlier itself and tried most of the suggestions there, but was not seeing much change. But seeing a better performance on your machine for the same code means maybe something is wrong in my setupJetta
M
0

for CPU you don't need to use io-binding, it is required only for GPU. and don't change session options as onnxruntime by default selects the best options.

the following things may help to speed up the gpu

  1. make sure to install onnxruntime-gpu which comes with prebuilt CUDA EP and TensortRT EP.
  2. you are currently binding the inputs and outputs to the CPU. when using onnxruntime with CUDA EP you should bind them to GPU (to avoid copying inputs/output btw CPU and GPU) refer here

I suggest you use io_binding.bind_input() method instead of io_binding.bind_cpu_input()

   for i in range(total_samples):
      t0 = time.time()
      ort_session.run_with_iobinding(io_binding)
      latency.append(time.time() - t0)
   -> ort_outs = io_binding.copy_outputs_to_cpu()

copying output every time from GPU to CPU for 1000 times drops the performance.

Majestic answered 18/1, 2022 at 2:57 Comment(2)
1) I am using onnxruntime-gpu since I have built it from source along with the CUDA execution provider 2) I had referred that before, but as 'Scenario 1' states there - 'A graph is executed on a device other than CPU, for instance CUDA. Users can use IOBinding to put input on CUDA as the follows' and the example has io_binding.bind_cpu_input('input', X) for this Scenario, I did it that way since for GPU i want to move it to GPU from CPU.Jetta
Anyway, I also tried as you suggested by changing the code as - data = onnxruntime.OrtValue.ortvalue_from_numpy(x, device.type, 0) io_binding.bind_input(input_names, device, 0, np.float32, [batch_size, 3, 224, 224], data.data_ptr()) but the inference time for onnxruntime is pretty much the same (around 48 ms and way slower than pytorch) 3) For copying outputs part, I am not including that time for performance measurementJetta

© 2022 - 2025 — McMap. All rights reserved.