CUDA streams not overlapping

Asked 20/5, 2011 at 10:10 Answered 14/8, 2012 at 12:53

I have something very similar to the code:

int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);

cudaMalloc(&g_in,  size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);

for (k = 0; k < no_streams; k++)
  mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));

for (k = 0; k < no_streams; k++)
  cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);

cudaThreadSynchronize();

cudaFree(g_in);
cudaFree(g_out);

'h_ptr_in' and 'h_ptr_out' are arrays of pointers allocated with cudaMallocHost (with no flags).

The problem is that the streams do not overlap. In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.

I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right? And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap...

Please HELP, what can be causing this?

I'm running on:

Ubuntu 10.04 x64

Device: "GeForce GTX 460" (CUDA Driver Version: 3.20, CUDA Runtime Version: 3.20, CUDA Capability Major/Minor version number: 2.1, Concurrent copy and execution: Yes, Concurrent kernel execution: Yes)

Rysler answered 20/5, 2011 at 10:10 Comment(2)

The profiling mechanism in CUDA causes serialization within streams under some circumstances. You can't use the profiler to judge overlap of asynchronous API operations. – Bannister 20/5, 2011 at 10:29

Thanks. Is there any other way to know for sure if the overlapping is correct? Judging by the timings it doesn't seem to be... – Rysler 20/5, 2011 at 10:36

According to this post on the NVIDIA forums, the profiler will serialize streaming to get accurate timing data. If you think your timings are off, make sure you're using CUDA events...

I've been experimenting with streaming lately, and I found the "simpleMultiCopy" example from the SDK to be really helpful, particularly with the appropriate logic and synchronizations.

Sunwise answered 20/5, 2011 at 14:42 Comment(0)

If you want to see the kernels overlap with kernels (concurrent kernels) you need to make use of CUDA Visual profiler 5.0 that comes with CUDA 5.0 Toolkit. I don't think previous profilers are capable of this. It should also show kernel and memcpy overlap.

Upcountry answered 14/8, 2012 at 12:53 Comment(1)

Is it really possible to observe streams overlap directly with the Visual Profiler 5.0? If so, how? I'm currently using the command-line profiler and importing the produced .csv file into the profiler, see NVIDIA post on overlapping streams. – Globe 6/6, 2013 at 10:42

Recommended topics

Hot tags