Inference with TensorRT .engine file on python

Asked 11/12, 2019 at 7:29 Answered 4/7, 2021 at 4:37

python tensorflow deep-learning computer-vision tensorrt

I used Nvidia's Transfer Learning Toolkit(TLT) to train and then used the tlt-converter to convert the .etlt model into an .engine file.

I want to use this .engine file for inference in python. But since I trained using TLT I dont have any frozen graphs or pb files which is what all the TensorRT inference tutorials need.

I would like to know if python inference is possible on .engine files. If not, what are the supported conversions(UFF,ONNX) to make this possible?

Viglione answered 11/12, 2019 at 7:29 Comment(4)

You mean running the .engine file with TensorRT from within Python? Does this documentation explain what you need? – Hush 7/2, 2020 at 13:29

Yes but I am having trouble figuring out the values in the h_output. Since I used TLT(detectnetv2), I dont know the model specifications and so I am not able to make sense of the list of floating point numbers that it predicts. – Viglione 8/2, 2020 at 6:54

@Sharan, hi, did you resolve this issue? i also want to inference detectnetv2 using just tensor rt, but don't know how to prepare input and handle output correctly... – Folks 20/11, 2020 at 15:40

@Folks No I was not able to resolve it. I even put a bounty on this question but still got no answers. – Viglione 22/11, 2020 at 13:25

Python inference is possible via .engine files. Example below loads a .trt file (literally same thing as an .engine file) from disk and performs single inference.

In this project, I've converted an ONNX model to TRT model using onnx2trt executable before using it. You can even convert a PyTorch model to TRT using ONNX as a middleware.


import tensorrt as trt
import numpy as np
import os

import pycuda.driver as cuda
import pycuda.autoinit



class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

class TrtModel:
    
    def __init__(self,engine_path,max_batch_size=1,dtype=np.float32):
        
        self.engine_path = engine_path
        self.dtype = dtype
        self.logger = trt.Logger(trt.Logger.WARNING)
        self.runtime = trt.Runtime(self.logger)
        self.engine = self.load_engine(self.runtime, self.engine_path)
        self.max_batch_size = max_batch_size
        self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers()
        self.context = self.engine.create_execution_context()

                
                
    @staticmethod
    def load_engine(trt_runtime, engine_path):
        trt.init_libnvinfer_plugins(None, "")             
        with open(engine_path, 'rb') as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
        return engine
    
    def allocate_buffers(self):
        
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()
        
        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding)) * self.max_batch_size
            host_mem = cuda.pagelocked_empty(size, self.dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            
            bindings.append(int(device_mem))

            if self.engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        
        return inputs, outputs, bindings, stream
       
            
    def __call__(self,x:np.ndarray,batch_size=2):
        
        x = x.astype(self.dtype)
        
        np.copyto(self.inputs[0].host,x.ravel())
        
        for inp in self.inputs:
            cuda.memcpy_htod_async(inp.device, inp.host, self.stream)
        
        self.context.execute_async(batch_size=batch_size, bindings=self.bindings, stream_handle=self.stream.handle)
        for out in self.outputs:
            cuda.memcpy_dtoh_async(out.host, out.device, self.stream) 
            
        
        self.stream.synchronize()
        return [out.host.reshape(batch_size,-1) for out in self.outputs]


        
        
if __name__ == "__main__":
 
    batch_size = 1
    trt_engine_path = os.path.join("..","models","main.trt")
    model = TrtModel(trt_engine_path)
    shape = model.engine.get_binding_shape(0)

    
    data = np.random.randint(0,255,(batch_size,*shape[1:]))/255
    result = model(data,batch_size)

Stay safe y'all!

Omidyar answered 11/5, 2021 at 18:35 Comment(0)

You can use the python to infer the .engine file. There are two ways to do that,

You need to install the Tensorrt and its compatible cuda on your system. On the same environment you need to convert the .etlt file into .engine file. Later you can use the python script to perform the inference. Reproducing the same activity on different system is quite tedious.
Another and simple approach is run the inference script from the tlt docker itself. This way, the docker provides all the necessary compatible packages. And this approach is efficient to scale. You can refer below article for detailed information. https://azmoosa.medium.com/deploying-nvidia-tlt-and-tensorrt-applications-using-docker-containers-ecb7abd6366f

Bookmaker answered 4/7, 2021 at 4:37 Comment(0)

Recommended topics

Hot tags