Why am I failing to overlap data transfers and computation with GTX 480 and CUDA 5?

Asked 22/1, 2013 at 10:20 Answered 18/3, 2015 at 8:31

Solved concurrency cuda overlapping nsight

I have tried to overlap kernel executions with memcpyasync but it doesn't work. I follow all recommendations in programming guide, using pinned memory, different streams, etc. I see kernel execution do overlap but it doesn't with mem transfers. I know my card has only one copy engine and one execution engine, but execution and tranfers should overlap, right?

It seems the "copy engine" and "execution engine" always enforce the order I call the functions. Work consists on 4 streams performing [HtoD x2, Kernel, DtoH]. If I issue HtoDx2,Kernel,DtoH serie on each stream, I see in profiler like the stream2 HtoD first operation will not start until the first DtoH operation ends. If I issue first the HtoD on each stream, then the second HtoD, then kernel and then DtoH (breadth) I see no overlap and the issue-order is also enforced by the GPU.

I have tried with the simpleStreams example given in CUDA SDK and I also see the same behavior.

I attach some screen captures showing the issue in both, visual profiler and Nsight for VS2008.

ps. I don't have set the CUDA_LAUNCH_BLOCKING env

Simple Streams Visual Profiler

MyApp Nsight timeline breadth first

MyApp Nsight timeline depth first

edit:

puting extra x4 kernels (total 2HtoD, 5 kernel, 1DtoH per stream) --> If I run nvprof with and without --concurrent-kernels-off, the elapsed time is the same. If I Set the env CUDA_LAUNCH_BLOCKING=1 then I see a performance improvement (from the command-line) of 7.5%!

System specification:

Windows 7
NVIDIA 6800 VGA in first PCI-E slot
GTX480 in second PCI-E slot
NVIDIA Driver: 306.94
Visual studio 2008
CUDA v5.0
Visual Profiler 5.0
Nsight 3.0

Lorrimor answered 22/1, 2013 at 10:20 Comment(21)

In the depth first example there is possible concurrency between there is no possibility for overlap as the GTX480 only has a single copy engine. In the breadth first example there is the potential to overlap between the HtoD and the kernels and kernels and DtoH. For Nsight VSE you may want to make sure you did not enable serialized trace. Please check the option under Nsight|Options...|Analysis|CUDA Kernel Trace Mode. If you post a reproducible I can help identify the problem. – Serval 23/1, 2013 at 4:39

Edited: In depth first example I would expect memcpy from 2nd stream start when memcpy from first stream ends and overlap it (partially) with kernel execution (and so on). – Lorrimor 23/1, 2013 at 8:47

btw Kernel Trace Mode is Concurrent (thanks for pointing that) – Lorrimor 23/1, 2013 at 8:50

Please provide a concrete sourcecode that reproduces the problem. – Jenn 24/1, 2013 at 11:33

@Dredok: I was trying to help you when I realized that I see the same behavior on my system. I know that I have seen overlap before on my system but i don't know what have changed since the last time I saw it. I have the same spec. Win 7 64bit, multiGPU system with GTX580, CUDA 5.0, driver 310.90. When running simpleStreams from SDK it spends more time on the streamed version than the serialized and I see now overlap what so ever in Nsight :/ This is rely bugging me. – Retrocede 24/1, 2013 at 12:13

@RoBiK: A concrete sourcecode would be the simpleStreams from SDK. – Retrocede 24/1, 2013 at 12:14

yes, simpleStreams from SDK also fails at overlapping memcpys and kernel executions. At other computer, running just one 8800GTS I see the overlap ... could it be due to multiGPU configuration? – Lorrimor 24/1, 2013 at 13:9

@brano: I can confirm the same behavior with my setup (CUDA 5.0, VS 2010, driver version 306.94, GTX 560 Ti). This looks like a driver problem. I tried also an old version of simpleStreams compiled against and running with CUDA 3.0 and i get the same behavior. – Jenn 24/1, 2013 at 13:10

@RoBiK: Are you also running Win 7? It could be a driver problem but it could also be the WDDM. – Retrocede 24/1, 2013 at 13:20

@Dredok: Are you running Win 7 on the system with only 1 GPU and what about CUDA toolkit version and driver? – Retrocede 24/1, 2013 at 13:24

@Retrocede On the system with only 1 GPU I run Win 7 x64 with more recent driver (cant remember but I would say 310.x) and CUDA 5.0 – Lorrimor 24/1, 2013 at 13:25

@Dredok: Hmmm so that leaves the multiGPU configuration as a possibility. I will have to verify this by pulling out all GPUs except 1. It this is true than win 7 as a development OS is useless in a multiGPU configuration. – Retrocede 24/1, 2013 at 13:28

@brano: yes, it is a Win 7 32 bit machine with a single GPU. Later today i can also try another machine with Win 7 x64 and a different GPU. – Jenn 24/1, 2013 at 14:37

I can confirm, at home, using win7 x64, GTS 8800 only-one GPU, driver 310.70 the simpleStreams behavior is the expected one (overlaping between kernel execution and mem transfers). – Lorrimor 24/1, 2013 at 14:58

Another SDK example you can try is the simpleMultiCopy. It produces the same problem. – Retrocede 24/1, 2013 at 15:6

Overlapping not working on Win 7 x64, GT 520M with developer driver 309.64 – Jenn 24/1, 2013 at 20:46

tested on 307.74 driver and it presents the same unexpected behavior :/ – Lorrimor 29/1, 2013 at 12:3

Have you reported this issue to the NVIDIA developers? This may be worth opening a ticket. You just need to be a registered CUDA developer. You can also try the latest beta drivers (320.00). – Hoary 7/5, 2013 at 11:17

@Hoary How can I open this ticket? Should I do it in devtalk nvidia forums? – Lorrimor 7/5, 2013 at 14:31

@Dredok: once you're registered, login and post your ticket here. – Hoary 7/5, 2013 at 14:36

good news, they told me there is a bug indeed so they are trying to fix it. – Lorrimor 8/5, 2013 at 17:3

As said in my comment, there is indeed a BUG with CUDA drivers and it makes streaming not working with my Setup. I have tested 1.1 capabilites card (8800 GTS) and 3.5 capabilities card (GTX Titan) and both cards works fine. It seems there is a problem with some Fermi cards (my GTX 480 does not work).

Lorrimor answered 26/5, 2013 at 11:49 Comment(0)

I just incurred with the same problem. I agree with your that there is a BUG. I think the bug is either in CUDA driver for Windows, or in the Windows itself. I have tested my code and it works well (with overlapping) in Linux.

In fact, you could test the "simpleStreams" example in SDK. I found that the "simpleStreams" running in Windows doesn't have overlapping between kernel and memory copy at all, but when in Linux it works perfectly.

I am using CUDA 5.0 and Fermi GTX570. With your test on 8800GT and GTX Titan, I would agree it is a bug in the CUDA driver for Windows. Hopefully it will be fixed soon.

Volkman answered 28/5, 2013 at 2:24 Comment(0)

TL;DR: The issue is caused by the WDDM TDR delay option in Nsight Monitor! When set to false, the issue appears. Instead, if you set the TDR delay value to a very high number, and the "enabled" option to true, the issue goes away. Please, try the options described below (more common), because they are also related to the problem!

Read below for other (older) steps followed until i came to the solution above, and some other possible causes.

I just recently were able to partially solve this problem! It is specific to windows and aero i think. Please try these steps and post your results to help others! I have tried it on GTX 650 and GT 640.

Before you do anything, consider using both onboard gpu(as display) and the discrete gpu (for computations), because there are verified issues with the nvidia driver for windows! When you use onboard gpu, said drivers don't get fully loaded, so many bugs are evaded. Also, system responsiveness is maintained while working!

Make sure your concurrency problem is not related to other issues like old drivers (including bios), wrong code, incapable device, etc.
Go to computer>properties
Select advanced system settings on the left side
Go to the Advanced tab
On Performance click settings
In the Visual Effects tab, select the "adjust for best performance" bullet.

This will disable aero and almost all visual effects. If this configuration works, you can try enabling one-by-one the boxes for visual effects until you find the precise one that causes problems!

Alternatively, you can:

Right click on desktop, select personalize
Select a theme from basic themes, that doesn't have aero.

This will also work as the above, but with more visual options enabled. For my two devices, this setting also works, so i kept it.

Please, when you try these solutions, come back here and post your findings!

For me, it solved the problem for most cases (a tiled dgemm i have made),but NOTE THAT i still can't run "simpleStreams" properly and achieve concurrency...

UPDATE: The problem is fully solved with a new windows installation!! The previous steps improved the behavior for some cases, but a fresh install solved all the problems!

I will try to find a less radical way of solving this problem, maybe restoring just the registry will be enough.

Mccoy answered 18/3, 2015 at 8:31 Comment(0)

Recommended topics

Hot tags