I'm implementing an algorithm as something below in compute shader
- for every pixel in the image
- calculate something and store it to a temp image
- for every pixel in the image
- wait until all its 8 nearby pixel have finished step 1
- read the data from temp image corresponding to its 8 nearby pixel
- use them to calculate the result
My work group size settings are layout (local_size_x = 256) in;
AndglDispatchCompute(1, 256, 1);
Before reading temp image in step 2, every pixel requires all its 8 neighbour have finished step 1. So I put a memoryBarrier() between step 1 and step 2, since the OpenGL Programming Guide, 8th Edition
says memory barrier functions apply globally
, not just the same local work group.
But this does not work as expected.
To demonstrate the result, consider a simplified but similar problem,
- draw a black rectangle on a white image
- for every pixel in the image
- if it's black, store 1 to temp image
- else store 0 to temp image
- if it's black, store 1 to temp image
- for every pixel in the image
- if it's black or at least one of its 8 nearby pixel is black, set itself to black
This should cause the black rectangle become larger and larger. But the result is, the rectangle become out of shape when becoming larger.
So, does memoryBarrier() really wait until all invocations triggered by the same glDispatchCompute call finish their memory access?
After I implement a lock between step 2 and 3, the result works as expected.(but later I found that sometimes it will cause the program crashed because of exceeding the Windows Time-Out limit!http://nvidia.custhelp.com/app/answers/detail/a_id/3007)
(p is current location, p+e[i] are its 8 nearby pixels' location. Instead of image variables, I use shader storage buffer object, so I add a function posi() to convert ivec2 to array index)
bool finished;
do
{
finished = true;
for(int i = 1; i < 9; i++)
{
if(!outOfBound(p+e[i]) && lock[posi(p+e[i])] != 1)
{
finished = false;
}
}
}while(!finished);
If I have misundertand the memoryBarrier() and it can't do what I want, is there any better way to synchronize the invocations of compute shader?
update to add compute shader code
Here is my compute shader code of the black rectangle example described above:
Actually tag is an image used to tell whether the color of the pixel is black or white, it's initialized to a small black rectangle on a white background.
temp is set to zero before I run this compute shader.
The commented code is about the lock described above. With this lock, the shader will give desired output.
#version 430 core
layout (local_size_x = 256) in;
const ivec2 e[9] = {
ivec2(0,0),
ivec2(1,0), ivec2(0,1), ivec2(-1,0), ivec2(0,-1),
ivec2(1,1), ivec2(-1,1), ivec2(-1,-1), ivec2(1,-1)
};
layout(std430, binding = 14) coherent buffer tag_buff
{
int tag[];
};
layout(std430, binding = 15) coherent buffer temp_buff
{
int temp[];
};
layout(std430, binding = 16) coherent buffer lock_buff
{
int lock[];
};
int posi(ivec2 point)
{
return point.y * 256 + point.x;
}
bool outOfBound(ivec2 p)
{
return p.x < 0 || p.x >= 256
|| p.y < 0 || p.y >= 256;
}
void main()
{
ivec2 p = ivec2(gl_GlobalInvocationID.xy);
int x = tag[posi(p)];
temp[posi(p)] = x;
//lock[posi(p)] = 1;
memoryBarrier();
//bool finished;
//do
//{
// finished = true;
// for(int i = 1; i < 9; i++)
// {
// if(!outOfBound(p+e[i]) && lock[posi(p+e[i])] != 1)
// {
// finished = false;
// }
// }
//}while(!finished);
// if it's black or at least one of its 8 nearby pixel is black
// set itself to black
for(int i = 0; i < 9; i++)
{
if(!outOfBound(p+e[i]) && temp[posi(p+e[i])] == 1)
{
tag[posi(p)] = 1;
}
}
}
Later I tried storing lock
into another ssbo after setting its elements to 1 and a memoryBarrier() call, and then load the new ssbo in fragment shader and print it to the screen, from which I found that some element of lock
had not been setted to 1.
I also use image variable instead of ssbo in fragment shader or compute shader, only to find memoryBarrier and coherent can't change anything. It just seems that memoryBarrier or coherent doesn't work.
After reading several material, it seems that I know what's happning here, I post my understanding below. If it's not true, please correct me.
The memoryBarrier
can't synchronize invocations by synchronizing memory accesses. More specifically, what exactly memoryBarrier
do is just waiting for completion of all memory accesses which have already happened in the invocations. It will not wait for the memory accessing code to finish which have not executed even though it's prior to the memoryBarrier
in the source code. The Opengl programming guide said When memoryBarrier() is called, it ensures that any writes to memory that have been performed by the shader invocation have been committed to memory rather than lingering in caches or being scheduled after the call to memoryBarrier()
. That's means, for example, assuming there are three invocations, if both invocation A and B have runned the imageStore() for a coherent
image variable, then a following memoryBarrier
of A or B will guarantee this imageStore() has changed the data in main memory, not just the cache. But if invocation C has not runned imageStore() when A or B call memoryBarrier
, then this memoryBarrier
call will not wait for C to run its imageStore(). So memoryBarrier
can't help me to implement the algorithm.
coherent
, because otherwise changes made in another invocation are not guaranteed to be visible. Each invocation may maintain its own separate cache without that qualifier. I could not begin to explain why your weird locking mechanism fixes the problem, but you definitely need coherency here. – Mischiefcoherent
to your SSB declarations (e.g.layout(std430, binding = 14) coherent buffer tag_buff
) does that change anything with your memory barrier? SSBs are not coherent by default. – Mischiefcoherent
to every one of your buffers. I just did not want to write a long comment that showed all that. – Mischief