I have something very similar to the code:
int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&stream[k]);
cudaMalloc(&g_in, size1*no_streams);
cudaMalloc(&g_out, size2*no_streams);
for (k = 0; k < no_streams; k++)
cudaMemcpyAsync(g_in+k*size1/sizeof(float), h_ptr_in[k], size1, cudaMemcpyHostToDevice, stream[k]);
for (k = 0; k < no_streams; k++)
mykernel<<<dimGrid, dimBlock, 0, stream[k]>>>(g_in+k*size1/sizeof(float), g_out+k*size2/sizeof(float));
for (k = 0; k < no_streams; k++)
cudaMemcpyAsync(h_ptr_out[k], g_out+k*size2/sizeof(float), size2, cudaMemcpyDeviceToHost, stream[k]);
cudaThreadSynchronize();
cudaFree(g_in);
cudaFree(g_out);
'h_ptr_in' and 'h_ptr_out' are arrays of pointers allocated with cudaMallocHost (with no flags).
The problem is that the streams do not overlap. In the visual profiler I can see the kernel execution from the first stream overlapping with the copy (H2D) from the second stream but nothing else overlaps.
I may not have resources to run 2 kernels (I think I do) but at least the kernel execution and copy should be overlaping, right? And if I put all 3 (copy H2D, kernel execution, copy D2H) within the same for-loop none of them overlap...
Please HELP, what can be causing this?
I'm running on:
Ubuntu 10.04 x64
Device: "GeForce GTX 460" (CUDA Driver Version: 3.20, CUDA Runtime Version: 3.20, CUDA Capability Major/Minor version number: 2.1, Concurrent copy and execution: Yes, Concurrent kernel execution: Yes)