This question is related to using cuda streams to run many kernels
In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty.
I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.
Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?