Example code
Here is a bare-bones compute shader to illustrate my question
layout(local_size_x = 64) in;
// Persistent LIFO structure with a count of elements
layout(std430, binding = 0) restrict buffer SMyBuffer
{
int count;
float data[];
} MyBuffer;
bool AddDataElement(uint i);
float ComputeDataElement(uint i);
void main()
{
for (uint i = gl_GlobalInvocationID.x; i < some_end_condition; i += gl_WorkGroupSize.x)
{
if (AddDataElement(i))
{
// We want to store this data piece in the next available free space
uint dataIndex = atomicAdd(MyBuffer.count, 1);
// [1] memoryBarrierBuffer() ?
MyBuffer.data[dataIndex] = ComputeDataElement(i);
}
}
}
Explanation
SMyBuffer
is a stack of elements (data[]
) with a count
of the current number of elements. When a certain condition is met, the compute shader increments the count atomically. This operation returns the previous index which is used to index data[]
to store the new element. This guarantees that no two shader invocations overwrite each other's elements.
Another compute shader eventually pops values from this stack and uses them. glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT)
is of course required between the two compute shader dispatches.
Question
All this works fine but I'm wondering if I'm just being lucky with timings and I want to validate my usage of the API.
So, is anything else required to make sure that the counter stored in the SSBO works (see 1)? I'm expecting atomicAdd()
takes care of the memory synchronization because otherwise it makes little sense. What would be the point of an atomic operation whose effect is only visible in a single thread?
Regarding memory barriers, the OpenGL wiki states:
Note that atomic counters are different functionally from atomic image/buffer variable operations. The latter still need coherent qualifiers, barriers, and the like.
which leaves me wondering whether there's something I haven't understood properly and a memoryBarrierBuffer()
is actually required. But then if that is the case, what's to stop 2 threads from performing atomicAdd()
before one of them gets to the subsequent memoryBarrierBuffer()
?
Also, does the answer change whether glDispatchCompute()
dispatches a single workgroup or more?