I have tried to overlap kernel executions with memcpyasync but it doesn't work. I follow all recommendations in programming guide, using pinned memory, different streams, etc. I see kernel execution do overlap but it doesn't with mem transfers. I know my card has only one copy engine and one execution engine, but execution and tranfers should overlap, right?
It seems the "copy engine" and "execution engine" always enforce the order I call the functions. Work consists on 4 streams performing [HtoD x2, Kernel, DtoH]. If I issue HtoDx2,Kernel,DtoH serie on each stream, I see in profiler like the stream2 HtoD first operation will not start until the first DtoH operation ends. If I issue first the HtoD on each stream, then the second HtoD, then kernel and then DtoH (breadth) I see no overlap and the issue-order is also enforced by the GPU.
I have tried with the simpleStreams example given in CUDA SDK and I also see the same behavior.
I attach some screen captures showing the issue in both, visual profiler and Nsight for VS2008.
ps. I don't have set the CUDA_LAUNCH_BLOCKING env
Simple Streams Visual Profiler
MyApp Nsight timeline breadth first
MyApp Nsight timeline depth first
edit:
puting extra x4 kernels (total 2HtoD, 5 kernel, 1DtoH per stream) --> If I run nvprof with and without --concurrent-kernels-off, the elapsed time is the same. If I Set the env CUDA_LAUNCH_BLOCKING=1 then I see a performance improvement (from the command-line) of 7.5%!
System specification:
- Windows 7
- NVIDIA 6800 VGA in first PCI-E slot
- GTX480 in second PCI-E slot
- NVIDIA Driver: 306.94
- Visual studio 2008
- CUDA v5.0
- Visual Profiler 5.0
- Nsight 3.0