TL;DR
I found evidence that supports both answers regarding coherent
.
Current score:
- Requiring
coherent
with atomics: 1.5
- Omitting
coherent
with atomics: 5.75
Bottom line, still not sure despite the score. Inside a single workgroup, I'm mostly convinced coherent
is not required in practice. I'm not so sure in these cases:
- more than 1 workgroup in
glDispatchCompute
- multiple
glDispatchCompute
calls that all access the same memory location (atomically) without any glMemoryBarrier
between them
However, is there a performance cost to declaring SSBOs (or individual struct members) coherent
when you only access them through atomic operations? Based on what is below, I don't believe there is because coherent
adds "visibility" instructions or instruction flags at the variable's read or write operations. If a variable is only accessed through atomic operations, the compiler should hopefully:
- ignore
coherent
when generating the atomic instructions because it has no effect
- use the appropriate mechanic to make sure the result of the atomic operation is visible outside the shader invocation, warp, workgroup or rendering command.
Note that atomic counters are different functionally from atomic image/buffer variable operations. The latter still need coherent qualifiers, barriers, and the like. (removed on 2020-04-12)
However, if memory has been modified in an incoherent fashion, any subsequent reads from that memory are not automatically guaranteed to see these changes.
+1 for requiring coherent
// Fragment shader used bor ACB gets output color from a texture
#version 430 core
uniform sampler2D texUnit;
layout(binding = 0) uniform atomic_uint acb[ s(nCounters) ];
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
void main()
{
for (int i=0; i< s(nCounters) ; ++i) atomicCounterIncrement(acb[i]);
fragColor = texture(texUnit, texcoord);
}
// Fragment shader used for SSBO gets output color from a texture
#version 430 core
uniform sampler2D texUnit;
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
layout(std430, binding = 0) buffer ssbo_data
{
uint v[ s(nCounters) ];
};
void main()
{
for (int i=0; i< s(nCounters) ; ++i) atomicAdd(v[i], 1);
fragColor = texture(texUnit, texcoord);
}
Notice that ssbo_data
in the second shader is not declared coherent
.
The article also states:
The OpenGL foundation recommends using [atomic counter buffers] over SSBOs for various reasons; however improved performance is not one of them. This is because ACBs are internally implemented as SSBO atomic operations; therefore there are no real performance benefits from utilizing ACBs.
So atomic counters are actually the same thing as SSBOs apparently. (But what are those "various reasons" and where are those recommendations? Is Intel hinting at a conspiracy in favor of atomic counters...?)
+1 for omitting coherent
GLSL specification
The GLSL spec uses different wording when describing coherent
and atomic operations (emphasis mine):
(4.10) When accessing memory using variables not declared as coherent, the memory accessed by a shader may be cached by the implementation to service future accesses to the same address. Memory stores may be cached in such a way that the values written might not be visible to other shader invocations accessing the same memory. The implementation may cache the values fetched by memory reads and return the same values to any shader invocation accessing the same memory, even if the underlying memory has been modified since the first memory read.
(8.11) Atomic memory functions perform atomic operations on an individual signed or unsigned integer stored in buffer-object or shared-variable storage. All of the atomic memory operations read a value from memory, compute a new value using one of the operations described below, write the new value to memory, and return the original value read. The contents of the memory being updated by the atomic operation are guaranteed not to be modified by any other assignment or atomic memory function in any shader invocation between the time the original value is read and the time the new value is written.
All the built-in functions in this section accept arguments with combinations of restrict, coherent, and volatile memory qualification, despite not having them listed in the prototypes. The atomic operation will operate as required by the calling argument’s memory qualification, not by the built-in function’s formal parameter memory qualification.
So on the one hand atomic operations are supposed to work directly with the storage's memory (does that imply bypassing possible caches?). On the other hand, it seems that memory qualifications (e.g. coherent
) play a role in what the atomic operation does.
+0.5 for requiring coherent
OpenGL specification
The OpenGL 4.6 spec sheds more light on this issue in section 7.13.1 "Shader Memory Access Ordering"
The built-in atomic memory transaction and atomic counter functions may be used to read and write a given memory address atomically. While built-in atomic functions issued by multiple shader invocations are executed in undefined order relative to each other, these functions perform both a read and a write of a memory address and guarantee that no other memory transaction will write to the underlying memory between the read and write. Atomics allow shaders to use shared global addresses for mutual exclusion or as counters, among other uses.
The intent of atomic operations then clearly seems to be, well, atomic all the time and not depending on a coherent
qualifier. Indeed, why would one want an atomic operation that isn't somehow combined between different shader invocations? Incrementing a locally cached value from multiple invocations and having all of them eventually write a completely independent value makes no sense.
+1 for omitting coherent
OpenGL spec issue #14
OpenGL 4.6: Do atomic counter buffers require the use of glMemoryBarrier
calls to be able to access the counter?
We discussed this again in the OpenGL|ES meeting. Based on feedback from IHVs and their implementation of atomic counters we're planning to treat them like we treat other resources like image atomic, image load/store, buffer variables, etc. in that they require explicit synchronization from the application. The spec will be changed to add "atomic counters" to the places where the other resources are enumerated.
The described spec change occurred in OpenGL 4.5 to 4.6, but relates to glMemoryBarrier
which plays no part in inside a single glDispatchCompute
.
no effect
Example Shader
Let's inspect the assembly produced by two simple shaders to see what happens in practice.
#version 460
layout(local_size_x = 512) in;
// Non-coherent qualified SSBO
layout(binding=0) restrict buffer Buf { uint count; } buf;
// Coherent qualified SSBO
layout(binding=1) coherent restrict buffer Buf_coherent { uint count; } buf_coherent;
void main()
{
// First shader with atomics (v1)
uint read_value1 = atomicAdd(buf.count, 2);
uint read_value2 = atomicAdd(buf_coherent.count, 4);
// Second shader with non-atomic add (v2)
buf.count += 2;
buf_coherent.count += 4;
}
The second shader is used to compare the effects of the coherent
qualifier between atomic operations and non-atomic operations.
AMD
AMD publishes Instruction Set Architecture (ISA) Documents which coupled with the Radeon GPU Analyzer gives insight into how GPUs actually implement this.
Shader v1 (Vega gfx900)
s_getpc_b64 s[0:1] BE801C80
s_mov_b32 s0, s2 BE800002
s_mov_b64 s[2:3], exec BE82017E
s_ff1_i32_b64 s4, exec BE84117E
s_lshl_b64 s[4:5], 1, s4 8E840481
s_and_b64 s[4:5], s[4:5], exec 86847E04
s_and_saveexec_b64 s[4:5], s[4:5] BE842004
s_cbranch_execz label_0010 BF880008
s_load_dwordx4 s[8:11], s[0:1], 0x00 C00A0200 00000000
s_bcnt1_i32_b64 s2, s[2:3] BE820D02
s_mulk_i32 s2, 0x0002 B7820002
v_mov_b32 v0, s2 7E000202
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
label_0010:
s_mov_b64 exec, s[4:5] BEFE0104
s_mov_b64 s[2:3], exec BE82017E
s_ff1_i32_b64 s4, exec BE84117E
s_lshl_b64 s[4:5], 1, s4 8E840481
s_and_b64 s[4:5], s[4:5], exec 86847E04
s_and_saveexec_b64 s[4:5], s[4:5] BE842004
s_cbranch_execz label_001F BF880008
s_load_dwordx4 s[8:11], s[0:1], 0x20 C00A0200 00000020
s_bcnt1_i32_b64 s0, s[2:3] BE800D02
s_mulk_i32 s0, 0x0004 B7800004
v_mov_b32 v0, s0 7E000200
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
label_001F:
s_endpgm BF810000
(Don't know why the exec mask and branching is used here...)
We can see that both atomic operations (on coherent and non-coherent buffers) result in the same instruction on all supported architectures of the Radeon GPU Analyzer:
buffer_atomic_add v0, v0, s[8:11], 0 E1080000 80020000
Decoding this instruction shows that the GLC
(Globally Coherent) flag is set to 0
which means for atomic operations: "Previous data value is not returned. No L1 persistence across wavefronts". Modifying the shader to use the returned values changes the GLC
flag of both atomic instructions to 1
which means: "Previous data value is returned. No L1 persistence across wavefronts".
The documents dating from 2013 (Sea Islands, etc.) have an interesting description of the BUFFER_ATOMIC_<op>
instructions:
Buffer object atomic operation. Always globally coherent.
So on AMD hardware, it appears coherent
has no effect for atomic operations.
Shader v2 (Vega gfx900)
s_getpc_b64 s[0:1] BE801C80
s_mov_b32 s0, s2 BE800002
s_load_dwordx4 s[4:7], s[0:1], 0x00 C00A0100 00000000
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_load_dword v0, v0, s[4:7], 0 E0500000 80010000
s_load_dwordx4 s[0:3], s[0:1], 0x20 C00A0000 00000020
s_waitcnt vmcnt(0) BF8C0F70
v_add_u32 v0, 2, v0 68000082
buffer_store_dword v0, v0, s[4:7], 0 glc E0704000 80010000
s_waitcnt lgkmcnt(0) BF8CC07F
buffer_load_dword v0, v0, s[0:3], 0 glc E0504000 80000000
s_waitcnt vmcnt(0) BF8C0F70
v_add_u32 v0, 4, v0 68000084
buffer_store_dword v0, v0, s[0:3], 0 glc E0704000 80000000
s_endpgm BF810000
The buffer_load_dword
operation on the coherent
buffer uses the glc
flag and the other one does not as expected.
On AMD: +1 for omitting coherent
NVIDIA
It's possible to get the assembly of a shader by inspecting the blob returned by glGetProgramBinary()
. The instructions are described in NV_gpu_program4, NV_gpu_program5 and NV_gpu_program5_mem_extended.
Shader v1
!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
ATOMB.ADD.U32 R0.x, {2, 0, 0, 0}, sbo_buf0[0];
ATOMB.ADD.U32 R0.x, {4, 0, 0, 0}, sbo_buf1[0];
END
There is no difference whether coherent
is present or not.
Shader v2
!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
LDB.U32 R0.x, sbo_buf0[0];
ADD.U R0.x, R0, {2, 0, 0, 0};
STB.U32 R0, sbo_buf0[0];
LDB.U32.COH R0.x, sbo_buf1[0];
ADD.U R0.x, R0, {4, 0, 0, 0};
STB.U32 R0, sbo_buf1[0];
END
The LDB.U32
operation on the coherent
buffer uses the COH
modifier which means "Make LOAD and STORE operations use coherent caching".
On NVIDIA: +1 for omitting coherent
SPIR-V (with Vulkan target)
Let's see what SPIR-V code is generated by the glslang SPIR-V generator.
Shader v1
// Generated with glslangValidator.exe -H --target-env vulkan1.1
// Module Version 10300
// Generated by (magic number): 80008
// Id's are bound by 30
Capability Shader
1: ExtInstImport "GLSL.std.450"
MemoryModel Logical GLSL450
EntryPoint GLCompute 4 "main"
ExecutionMode 4 LocalSize 512 1 1
Source GLSL 460
Name 4 "main"
Name 8 "read_value1"
Name 9 "Buf"
MemberName 9(Buf) 0 "count"
Name 11 "buf"
Name 20 "read_value2"
Name 21 "Buf_coherent"
MemberName 21(Buf_coherent) 0 "count"
Name 23 "buf_coherent"
MemberDecorate 9(Buf) 0 Restrict
MemberDecorate 9(Buf) 0 Offset 0
Decorate 9(Buf) Block
Decorate 11(buf) DescriptorSet 0
Decorate 11(buf) Binding 0
MemberDecorate 21(Buf_coherent) 0 Coherent
MemberDecorate 21(Buf_coherent) 0 Restrict
MemberDecorate 21(Buf_coherent) 0 Offset 0
Decorate 21(Buf_coherent) Block
Decorate 23(buf_coherent) DescriptorSet 0
Decorate 23(buf_coherent) Binding 1
Decorate 29 BuiltIn WorkgroupSize
2: TypeVoid
3: TypeFunction 2
6: TypeInt 32 0
7: TypePointer Function 6(int)
9(Buf): TypeStruct 6(int)
10: TypePointer StorageBuffer 9(Buf)
11(buf): 10(ptr) Variable StorageBuffer
12: TypeInt 32 1
13: 12(int) Constant 0
14: TypePointer StorageBuffer 6(int)
16: 6(int) Constant 2
17: 6(int) Constant 1
18: 6(int) Constant 0
21(Buf_coherent): TypeStruct 6(int)
22: TypePointer StorageBuffer 21(Buf_coherent)
23(buf_coherent): 22(ptr) Variable StorageBuffer
25: 6(int) Constant 4
27: TypeVector 6(int) 3
28: 6(int) Constant 512
29: 27(ivec3) ConstantComposite 28 17 17
4(main): 2 Function None 3
5: Label
8(read_value1): 7(ptr) Variable Function
20(read_value2): 7(ptr) Variable Function
15: 14(ptr) AccessChain 11(buf) 13
19: 6(int) AtomicIAdd 15 17 18 16
Store 8(read_value1) 19
24: 14(ptr) AccessChain 23(buf_coherent) 13
26: 6(int) AtomicIAdd 24 17 18 25
Store 20(read_value2) 26
Return
FunctionEnd
The only difference between buf
and buf_coherent
is the decoration of the latter with MemberDecorate 21(Buf_coherent) 0 Coherent
. Their usage afterwards is identical.
Adding #pragma use_vulkan_memory_model
to the shader enables the Vulkan memory model and produces these (abbreviated) changes:
Capability Shader
+ Capability VulkanMemoryModelKHR
+ Extension "SPV_KHR_vulkan_memory_model"
1: ExtInstImport "GLSL.std.450"
- MemoryModel Logical GLSL450
+ MemoryModel Logical VulkanKHR
EntryPoint GLCompute 4 "main"
Decorate 11(buf) Binding 0
- MemberDecorate 21(Buf_coherent) 0 Coherent
MemberDecorate 21(Buf_coherent) 0 Restrict
which means... I don't quite know because I'm not versed in Vulkan's intricacies. I did found this informative section of the "Memory Model" appendix in the Vulkan 1.2 spec:
While GLSL (and legacy SPIR-V) applies the “coherent” decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations. GLSL to SPIR-V compilers should map all (non-atomic) operations on a coherent variable to Make{Pointer,Texel}{Available}{Visible} flags in this model.
Atomic operations implicitly have availability/visibility operations, and the scope of those operations is taken from the atomic operation’s scope.
Shader v2
(skipping full output)
The only difference between buf
and buf_coherent
is again MemberDecorate 18(Buf_coherent) 0 Coherent
.
Adding #pragma use_vulkan_memory_model
to the shader enables the Vulkan memory model and produces these (abbreviated) changes:
- MemberDecorate 18(Buf_coherent) 0 Coherent
- 23: 6(int) Load 22
- 24: 6(int) IAdd 23 21
- 25: 13(ptr) AccessChain 20(buf_coherent) 11
- Store 25 24
+ 23: 6(int) Load 22 MakePointerVisibleKHR NonPrivatePointerKHR 24
+ 25: 6(int) IAdd 23 21
+ 26: 13(ptr) AccessChain 20(buf_coherent) 11
+ Store 26 25 MakePointerAvailableKHR NonPrivatePointerKHR 24
Notice the addition of MakePointerVisibleKHR
and MakePointerAvailableKHR
that control operation coherency at the instruction level instead of the variable level.
+1 for omitting coherent
(maybe?)
CUDA
The Parallel Thread Execution ISA section of the CUDA Toolkit documentation has this information:
8.5. Scope
Each strong operation must specify a scope, which is the set of threads that may interact directly with that operation and establish any of the relations described in the memory consistency model. There are three scopes:
Table 18. Scopes
.cta
: The set of all threads executing in the same CTA as the current thread.
.gpu
: The set of all threads in the current program executing on the same compute device as the current thread. This also includes other kernel grids invoked by the host program on the same compute device.
.sys
The set of all threads in the current program, including all kernel grids invoked by the host program on all compute devices, and all threads constituting the host program itself.
Note that the warp is not a scope; the CTA is the smallest collection of threads that qualifies as a scope in the memory consistency model.
Regarding CTA:
A cooperative thread array (CTA) is a set of concurrent threads that execute the same kernel program. A grid is a set of CTAs that execute independently.
So in GLSL terms, CTA == work group and grid == glDispatchCompute
call.
The atom
instruction description:
9.7.12.4. Parallel Synchronization and Communication Instructions: atom
Atomic reduction operations for thread-to-thread communication.
[...]
The optional .scope qualifier specifies the set of threads that can directly observe the memory synchronizing effect of this operation, as described in the Memory Consistency Model.
[...]
If no scope is specified, the atomic operation is performed with .gpu scope.
So by default, all shader invocations of a glDispatchCompute
would see the result of an atomic operation... unless the GLSL compiler generates something that uses the cta
scope in which case it would only be visible inside the workgroup. This latter case however corresponds to shared
GLSL variables so perhaps it's only used for those and not for SSBO operations. NVIDIA isn't very open about this process so I haven't found a way to tell for sure (perhaps with glGetProgramBinary
). However, since the semantics of cta
map to a work group and gpu
to buffers (i.e. SSBO, images, etc), I declare:
+0.5 for omitting coherent
Empirical evidence
I have written a particle system compute shader that uses an SSBO backed variable as an operand to atomicAdd()
and it works. Usage of of coherent
was not necessary even with a work group size of 512. However, there was never more than 1 work group. This was tested mainly on an Nvidia GTX 1080 so as seen above, atomic operations on NVIDIA seem to always be at least visible inside the work group.
+0.25 for omitting coherent