Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?

Asked 13/6, 2022 at 14:54 Answered 14/6, 2022 at 9:21

This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.

I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.

Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.

In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).

One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.

In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).

At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?

Vaclava answered 13/6, 2022 at 14:54 Comment(0)

The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability

In particular it says that

Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.

That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:

This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.

So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.

Desiderate answered 13/6, 2022 at 16:24 Comment(3)

So I guess that answer of the title question is YES ? – Collard 14/6, 2022 at 8:23

@YunusTemurlenk: Not quite. There has to be a barrier / join / rendezvous between the different GPU kernels accessing the shared memory. There doesn't need to be a CPU wait (although waiting at the CPU might be the most portable and backward-compatible solution). – Desiderate 14/6, 2022 at 15:0

@BenVoigt Thanks for thorough explanation. I need to furthur see how and if this is handled internally in OpenCV during unmap... – Vaclava 14/6, 2022 at 17:46

As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.

However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.

Aaronson answered 14/6, 2022 at 9:21 Comment(0)

Recommended topics

Hot tags