How to optimize cudaHostAlloc and cudaLaunchKernel times in pytorch training

I am trying to profile my model with pytorch profiler. I used the below code to profile

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model_inference"):
        output_batch = self.model(input_batch)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

The profiler output is as follows

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                        model_inference         3.17%      83.011ms        63.97%        1.675s        1.675s       0.000us         0.00%     373.844ms     373.844ms             1  
                                            aten::copy_         0.24%       6.333ms        39.93%        1.046s       1.504ms      28.758ms         7.69%      29.035ms      41.777us           695  
                                          cudaHostAlloc        36.02%     943.053ms        36.02%     943.053ms      30.421ms       0.000us         0.00%       0.000us       0.000us            31  
                                       cudaLaunchKernel        35.93%     940.773ms        35.93%     940.773ms      86.619us       0.000us         0.00%       0.000us       0.000us         10861  
                                           aten::repeat         0.04%     979.000us        33.77%     884.170ms      30.489ms       0.000us         0.00%     204.000us       7.034us            29  
                                           aten::conv2d         0.06%       1.481ms         8.71%     228.183ms     695.680us       0.000us         0.00%     145.688ms     444.171us           328  
                                      aten::convolution         0.05%       1.391ms         8.66%     226.702ms     691.165us       0.000us         0.00%     145.688ms     444.171us           328  
                                     aten::_convolution         0.10%       2.742ms         8.61%     225.311ms     686.924us       0.000us         0.00%     145.688ms     444.171us           328  
                                aten::cudnn_convolution         0.53%      13.803ms         8.33%     218.051ms     664.790us     137.822ms        36.87%     137.822ms     420.189us           328  
                                               cudaFree         7.46%     195.373ms         7.46%     195.373ms      48.843ms       0.000us         0.00%       0.000us       0.000us             4  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 2.618s
Self CUDA time total: 373.844ms

I notice that a large part of the time (self CPU) is taken by cudaHostAlloc and cudaLaunchKernel. What are these cudaHostAlloc and cudaLaunchKernel? Is it possible to reduce this time? If yes how? Are there any standard operations that I'm missing which is leading to this high time consumption?

PS: I'm new to profiling as such. Kindly let me know if any other information is needed.

I am not an expert but I think cudaLaunchKernel is called at every operation done with cuda. So I think you cannot optimize it.

If you plot a detailed tracing https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#using-tracing-functionality, you see that it is called everytime you do a cuda operation like here for a linear layer.

One note on you profiler output: aten::copy_ cudaHostAlloc cudaLaunchKernel and aten::repeat all take roughly 40% of the CPU total time. I think it may be related to ProfilerActivity.CUDA that records CUDA operation but it also add a lot of CPU time on your first CUDA operation that is profiled. In my case a simple torch.ones(1000, device="cuda") took a full second of CPU time because it was the first cuda operation.

It may be the problem in your case, try to remove ProfilerActivity.CUDA and maybe aten::copy_ cudaHostAlloc cudaLaunchKernel and aten::repeat will have a much smaller CPU time and will disappear from the table.

Recommended topics

Hot tags