Memory barrier fails to sync between compute stage and data access by CUDA

I have the following pipeline:

Render into texture attachment to custom FBO.
Bind that texture attachment as image.
Run compute shader ,sampling from the image above using imageLoad/Store.
Write the results into SSBO or image.
Map the SSBO (or image) as CUDA CUgraphicsResource and process the data from that buffer using CUDA.

Now,the problem is in synchronizing data between the stages 4 and 5. Here are the sync solutions I have tried.

glFlush - doesn't really work as it doesn't guarantee the completeness of the execution of all the commands.

glFinish - this one works. But it is not recommended as it finalizes all the commands submitted to the driver.

ARB_sync Here it is said it is not recommended because it heavily impacts performance.

glMemoryBarrier This one is interesting. But it simply doesn't work.

Here is example of the code:

glMemoryBarrier(GL_ALL_BARRIER_BITS);

And also tried:

glTextureBarrierNV()

The code execution goes like this:

 //rendered into the fbo...
  glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo);
  glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8); 
  glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8));
  glDispatchCompute(16, 16, 1);

  glFinish(); // <-- must sync here,otherwise cuda buffer doesn't receive all the data

 //cuda maps the image to CUDA buffer here..

Moreover, I tried unbinding FBOs and unbinding textures from the context before launching compute, I even tried to launch one compute after other with a glMemoryBarrier set between them, and fetching the target image from the first compute launch to CUDA. Still no synch. (Well,that makes sense as two computes also run out of sync with each other)

after the compute shader stage. It doesn't sync! Only when I replace with glFinish,or with any other operation which completely stall the pipeline. Like glMapBuffer(), for example.

~~So should I just use glFinish(), or I am missing something here? Why glMemoryBarrier() doesn't sync compute shader work before CUDA takes over the control?~~

UPDATE

I would like to refactor the question a little bit as the original one is pretty old. Nevertheless, even with the latest CUDA and Video Codec SDK (NVENC) the issue is still alive.So, I don't care about why glMemoryBarrier doesn't sync. What I want to know is:

If it is possible to synchronize OpenGL compute shader execution finish with CUDA's usage of that shared resource without stalling the whole rendering pipeline, which is in my case OpenGL image.
If the answer is 'yes', then how?

.... glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo); glBindImageTexture(imageUnit1, fboTex, 0, GL_FALSE, 0, GL_READ_ONLY,GL_RGBA8); glBindImageTexture(imageUnit2, imageTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA8)); glDispatchCompute(16, 16, 1); GLsync fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0); ... other work you might want to do that does not impact the buffer... GLenum res = glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, timeoutInNs); if(res == GL_TIMEOUT_EXPIRED || res == GL_WAIT_FAILED) { ...handle timeouts and failures } cudaGraphicsMapResources(1, &gfxResource, stream); ...

Recommended topics

Hot tags