Any better way to synchronize all invocations of compute shader?
Asked Answered
W

1

11

I'm implementing an algorithm as something below in compute shader

  1. for every pixel in the image
    • calculate something and store it to a temp image
  2. for every pixel in the image
    • wait until all its 8 nearby pixel have finished step 1
    • read the data from temp image corresponding to its 8 nearby pixel
    • use them to calculate the result

My work group size settings are layout (local_size_x = 256) in; AndglDispatchCompute(1, 256, 1);
Before reading temp image in step 2, every pixel requires all its 8 neighbour have finished step 1. So I put a memoryBarrier() between step 1 and step 2, since the OpenGL Programming Guide, 8th Edition says memory barrier functions apply globally, not just the same local work group.
But this does not work as expected.

To demonstrate the result, consider a simplified but similar problem,

  1. draw a black rectangle on a white image
  2. for every pixel in the image
    • if it's black, store 1 to temp image
    • else store 0 to temp image
  3. for every pixel in the image
    • if it's black or at least one of its 8 nearby pixel is black, set itself to black

This should cause the black rectangle become larger and larger. But the result is, the rectangle become out of shape when becoming larger.

So, does memoryBarrier() really wait until all invocations triggered by the same glDispatchCompute call finish their memory access?

After I implement a lock between step 2 and 3, the result works as expected.(but later I found that sometimes it will cause the program crashed because of exceeding the Windows Time-Out limit!http://nvidia.custhelp.com/app/answers/detail/a_id/3007)
(p is current location, p+e[i] are its 8 nearby pixels' location. Instead of image variables, I use shader storage buffer object, so I add a function posi() to convert ivec2 to array index)

bool finished;
do
{
    finished = true;
    for(int i = 1; i < 9; i++)
    {
        if(!outOfBound(p+e[i]) && lock[posi(p+e[i])] != 1)
        {
            finished = false;
        }
    }
}while(!finished);

If I have misundertand the memoryBarrier() and it can't do what I want, is there any better way to synchronize the invocations of compute shader?

update to add compute shader code

Here is my compute shader code of the black rectangle example described above:
Actually tag is an image used to tell whether the color of the pixel is black or white, it's initialized to a small black rectangle on a white background. temp is set to zero before I run this compute shader. The commented code is about the lock described above. With this lock, the shader will give desired output.

#version 430 core

layout (local_size_x = 256) in;

const ivec2 e[9] = {
    ivec2(0,0),
    ivec2(1,0), ivec2(0,1), ivec2(-1,0), ivec2(0,-1), 
    ivec2(1,1), ivec2(-1,1), ivec2(-1,-1), ivec2(1,-1)
};

layout(std430, binding = 14) coherent buffer tag_buff
{
    int tag[];
};
layout(std430, binding = 15) coherent buffer temp_buff
{
    int temp[];
};
layout(std430, binding = 16) coherent buffer lock_buff
{
    int lock[];
};

int posi(ivec2 point)
{
    return point.y * 256 + point.x;
}

bool outOfBound(ivec2 p)
{
    return p.x < 0 || p.x >= 256
        || p.y < 0 || p.y >= 256;
}

void main()
{
    ivec2 p = ivec2(gl_GlobalInvocationID.xy);

    int x = tag[posi(p)];
    temp[posi(p)] = x;
    //lock[posi(p)] = 1;

    memoryBarrier();

    //bool finished;
    //do
    //{
    //    finished = true;
    //    for(int i = 1; i < 9; i++)
    //    {
    //        if(!outOfBound(p+e[i]) && lock[posi(p+e[i])] != 1)
    //        {
    //            finished = false;
    //        }
    //    }
    //}while(!finished);

    // if it's black or at least one of its 8 nearby pixel is black
    // set itself to black
    for(int i = 0; i < 9; i++)
    {
        if(!outOfBound(p+e[i]) && temp[posi(p+e[i])] == 1)
        {
            tag[posi(p)] = 1;
        }
    }
}

Later I tried storing lock into another ssbo after setting its elements to 1 and a memoryBarrier() call, and then load the new ssbo in fragment shader and print it to the screen, from which I found that some element of lock had not been setted to 1. I also use image variable instead of ssbo in fragment shader or compute shader, only to find memoryBarrier and coherent can't change anything. It just seems that memoryBarrier or coherent doesn't work.

After reading several material, it seems that I know what's happning here, I post my understanding below. If it's not true, please correct me.

The memoryBarrier can't synchronize invocations by synchronizing memory accesses. More specifically, what exactly memoryBarrier do is just waiting for completion of all memory accesses which have already happened in the invocations. It will not wait for the memory accessing code to finish which have not executed even though it's prior to the memoryBarrier in the source code. The Opengl programming guide said When memoryBarrier() is called, it ensures that any writes to memory that have been performed by the shader invocation have been committed to memory rather than lingering in caches or being scheduled after the call to memoryBarrier(). That's means, for example, assuming there are three invocations, if both invocation A and B have runned the imageStore() for a coherent image variable, then a following memoryBarrier of A or B will guarantee this imageStore() has changed the data in main memory, not just the cache. But if invocation C has not runned imageStore() when A or B call memoryBarrier, then this memoryBarrier call will not wait for C to run its imageStore(). So memoryBarrier can't help me to implement the algorithm.

Wacke answered 7/6, 2014 at 6:33 Comment(9)
You should declare your buffers coherent, because otherwise changes made in another invocation are not guaranteed to be visible. Each invocation may maintain its own separate cache without that qualifier. I could not begin to explain why your weird locking mechanism fixes the problem, but you definitely need coherency here.Mischief
@AndonM.Coleman Sorry for my confusing words, that code is not in a separate shader, it's a barrier made by myself.Wacke
If you add coherent to your SSB declarations (e.g. layout(std430, binding = 14) coherent buffer tag_buff) does that change anything with your memory barrier? SSBs are not coherent by default.Mischief
@AndonM.Coleman It does not fix the problem. I'm thinking whether it's the effect of last change to tag is not visible by the subsequent shaders. But I have added glMemoryBarrier(GL_ALL_BARRIER_BITS) after glDispatchCompute.Wacke
The problem is that, image16-c.poco.cn/mypoco/myphoto/20140607/16/… the top edge of rectangle become not horizontal as it get larger.Wacke
@AndonM.Coleman Why add coherent to tag_buff but not to temp_buff? So temp_buff doesn't need to be declared coherent and memoryBarrier will make it visible by other compute shader invocations?Wacke
No, I meant for you to add coherent to every one of your buffers. I just did not want to write a long comment that showed all that.Mischief
Let us continue this discussion in chat.Wacke
@AndonM.Coleman I add my understanding at the end of my question. Is it correct?Wacke
F
5

I stumbled accross a similar problem. I am no expert but I believe I found a good solution.

You correctly identified memoryBarrier as necessary to ensure visibility of previous writes.

However, on its own memoryBarrier is nearly useless because it does not ensure execution ordering. So although you have a memoryBarrier there could be invocations that are fully finished before others even start to run. memoryBarrier can not make writes visible that have not yet happened.

We have barrier to remedy this:

For any given static instance of barrier in a compute shader, all invocations within a single work group must enter it before any are allowed to continue beyond it.

Note the emphasis: barrier does not help you synchronize accross work groups within one glDispatchCompute call, it only synchronizes within work groups.

Obviously, barrier does not help with your problem, so you introduced your own barrier which has disadvantages:

  1. The compiler/driver/scheduler does not know it is a barrier so it can't be optimized.
  2. Your barrier uses a spin lock which hogs the processor. This increases running time until the watch dog timer triggers.

If the driver knew about the barrier it could schedule those invocations that did not yet reach the barrier to run. In your solution, the driver blindly schedules all invocations, wasting resources on already waiting ones instead of running those that have not yet reached the barrier.

What to do instead?

Solution

To achieve a barrier accross all invocations just do multiple glDispatchCompute interleaved with appropriate glMemoryBarrier calls.

The seperation into multiple glDispatchCompute calls creates the barrier between them. glMemoryBarrier makes the writes of previous invocations visible to later ones.

Fag answered 9/7, 2015 at 9:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.