How to reduce CUDA synchronize latency / delay
Asked Answered
F

1

8

This question is related to using cuda streams to run many kernels

In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty.

I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.

Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?

Fulminant answered 14/8, 2012 at 13:48 Comment(1)
cudaThreadSynchronize is deprecated.Empiric
C
12

The main difference between synchronize methods is "polling" and "blocking."

"Polling" is the default mechanism for the driver to wait for the GPU - it waits for a 32-bit memory location to attain a certain value written by the GPU. It may return the wait more quickly after the wait is resolved, but while waiting, it burns a CPU core looking at that memory location.

"Blocking" can be requested by calling cudaSetDeviceFlags() with cudaDeviceScheduleBlockingSync, or calling cudaEventCreate() with cudaEventBlockingSync. Blocking waits cause the driver to insert a command into the DMA command buffer that signals an interrupt when all preceding commands in the buffer have been executed. The driver can then map the interrupt to a Windows event or a Linux file handle, enabling the synchronization commands to wait without constantly burning CPU, as do the default polling methods.

The queries are basically a manual check of that 32-bit memory location used for polling waits; so in most situations, they are very cheap. But if ECC is enabled, the query will dive into kernel mode to check if there are any ECC errors; and on Windows, any pending commands will be flushed to the driver (which requires a kernel thunk).

Coatee answered 15/8, 2012 at 1:26 Comment(4)
It sounds like the difference between polling and blocking is that polling burns CPU time and Blocking does not. However there is no difference in time taken for the sync to happen. In a situation where there is no work to be done by the CPU, they reduce to the same thing. Is that correct ?Fulminant
There may be time differences, because the interrupt handling adds latency. So in exchange for not burning CPU on the polling, you pay in the form of a longer time between the wait being resolved and the thread getting unblocked as a result.Coatee
But what difference between cudaDeviceScheduleBlockingSync and cudaDeviceScheduleYield? cudaDeviceScheduleYield as written: "Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device." - i.e. wait result without burn CPU in spin - i.e. "Blocking". And cudaDeviceScheduleBlockingSync too - wait result without burn CPU in spin. But what difference?Counterplot
The yield option polls (repeatedly reads a memory location), but calls functions such as SwitchToThread during the polling to reduce the CPU overhead. msdn.microsoft.com/en-us/library/windows/desktop/…Coatee

© 2022 - 2024 — McMap. All rights reserved.