GLSL: about coherent qualifier
Asked Answered
H

1

7

I didn't get clearly how coherent qualifier and atomic operations work together.

I perform some accumulating operation on the same SSBO location with this code:

uint prevValue, newValue;
uint readValue = ssbo[index];
do
{
    prevValue = readValue;
    newValue = F(readValue);
}
while((readValue = atomicCompSwap(ssbo[index], prevValue, newValue)) != prevValue);

This code works fine for me, but still, do I need to declare the SSBO (or Image) with coherent qualifier in this case?

And do I need to use coherent in a case when I call only atomicAdd?

When exactly do I need to use coherent qualifier? Do I need to use it only in case of direct writing: ssbo[index] = value;?

Habitue answered 28/5, 2019 at 10:36 Comment(1)
related: #57115120Coaction
C
14

TL;DR

I found evidence that supports both answers regarding coherent.

Current score:

  • Requiring coherent with atomics: 1.5
  • Omitting coherent with atomics: 5.75

Bottom line, still not sure despite the score. Inside a single workgroup, I'm mostly convinced coherent is not required in practice. I'm not so sure in these cases:

  1. more than 1 workgroup in glDispatchCompute
  2. multiple glDispatchCompute calls that all access the same memory location (atomically) without any glMemoryBarrier between them

However, is there a performance cost to declaring SSBOs (or individual struct members) coherent when you only access them through atomic operations? Based on what is below, I don't believe there is because coherent adds "visibility" instructions or instruction flags at the variable's read or write operations. If a variable is only accessed through atomic operations, the compiler should hopefully:

  1. ignore coherent when generating the atomic instructions because it has no effect
  2. use the appropriate mechanic to make sure the result of the atomic operation is visible outside the shader invocation, warp, workgroup or rendering command.

From the OpenGL wiki's "Memory Model" page:

Note that atomic counters are different functionally from atomic image/buffer variable operations. The latter still need coherent qualifiers, barriers, and the like. (removed on 2020-04-12)

However, if memory has been modified in an incoherent fashion, any subsequent reads from that memory are not automatically guaranteed to see these changes.

+1 for requiring coherent

The code from Intel's article "OpenGL Performance Tips: Atomic Counter Buffers versus Shader Storage Buffer Objects"

// Fragment shader used bor ACB gets output color from a texture
#version 430 core

uniform sampler2D texUnit;
layout(binding = 0) uniform atomic_uint acb[ s(nCounters) ];
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;

void main()
{
    for (int i=0; i<  s(nCounters) ; ++i) atomicCounterIncrement(acb[i]);
    fragColor = texture(texUnit, texcoord);
}

// Fragment shader used for SSBO gets output color from a texture
#version 430 core

uniform sampler2D texUnit;
smooth in vec2 texcoord;
layout(location = 0) out vec4 fragColor;
layout(std430, binding = 0) buffer ssbo_data
{
    uint v[ s(nCounters) ];
};

void main()
{
    for (int i=0; i< s(nCounters) ; ++i) atomicAdd(v[i], 1);
    fragColor = texture(texUnit, texcoord);
}

Notice that ssbo_data in the second shader is not declared coherent.

The article also states:

The OpenGL foundation recommends using [atomic counter buffers] over SSBOs for various reasons; however improved performance is not one of them. This is because ACBs are internally implemented as SSBO atomic operations; therefore there are no real performance benefits from utilizing ACBs.

So atomic counters are actually the same thing as SSBOs apparently. (But what are those "various reasons" and where are those recommendations? Is Intel hinting at a conspiracy in favor of atomic counters...?)

+1 for omitting coherent

GLSL specification

The GLSL spec uses different wording when describing coherent and atomic operations (emphasis mine):

(4.10) When accessing memory using variables not declared as coherent, the memory accessed by a shader may be cached by the implementation to service future accesses to the same address. Memory stores may be cached in such a way that the values written might not be visible to other shader invocations accessing the same memory. The implementation may cache the values fetched by memory reads and return the same values to any shader invocation accessing the same memory, even if the underlying memory has been modified since the first memory read.

(8.11) Atomic memory functions perform atomic operations on an individual signed or unsigned integer stored in buffer-object or shared-variable storage. All of the atomic memory operations read a value from memory, compute a new value using one of the operations described below, write the new value to memory, and return the original value read. The contents of the memory being updated by the atomic operation are guaranteed not to be modified by any other assignment or atomic memory function in any shader invocation between the time the original value is read and the time the new value is written.

All the built-in functions in this section accept arguments with combinations of restrict, coherent, and volatile memory qualification, despite not having them listed in the prototypes. The atomic operation will operate as required by the calling argument’s memory qualification, not by the built-in function’s formal parameter memory qualification.

So on the one hand atomic operations are supposed to work directly with the storage's memory (does that imply bypassing possible caches?). On the other hand, it seems that memory qualifications (e.g. coherent) play a role in what the atomic operation does.

+0.5 for requiring coherent

OpenGL specification

The OpenGL 4.6 spec sheds more light on this issue in section 7.13.1 "Shader Memory Access Ordering"

The built-in atomic memory transaction and atomic counter functions may be used to read and write a given memory address atomically. While built-in atomic functions issued by multiple shader invocations are executed in undefined order relative to each other, these functions perform both a read and a write of a memory address and guarantee that no other memory transaction will write to the underlying memory between the read and write. Atomics allow shaders to use shared global addresses for mutual exclusion or as counters, among other uses.

The intent of atomic operations then clearly seems to be, well, atomic all the time and not depending on a coherent qualifier. Indeed, why would one want an atomic operation that isn't somehow combined between different shader invocations? Incrementing a locally cached value from multiple invocations and having all of them eventually write a completely independent value makes no sense.

+1 for omitting coherent

OpenGL spec issue #14

OpenGL 4.6: Do atomic counter buffers require the use of glMemoryBarrier calls to be able to access the counter?

We discussed this again in the OpenGL|ES meeting. Based on feedback from IHVs and their implementation of atomic counters we're planning to treat them like we treat other resources like image atomic, image load/store, buffer variables, etc. in that they require explicit synchronization from the application. The spec will be changed to add "atomic counters" to the places where the other resources are enumerated.

The described spec change occurred in OpenGL 4.5 to 4.6, but relates to glMemoryBarrier which plays no part in inside a single glDispatchCompute.

no effect

Example Shader

Let's inspect the assembly produced by two simple shaders to see what happens in practice.

#version 460
layout(local_size_x = 512) in;

// Non-coherent qualified SSBO
layout(binding=0) restrict buffer Buf { uint count; } buf;

// Coherent qualified SSBO
layout(binding=1) coherent restrict buffer Buf_coherent { uint count; } buf_coherent;

void main()
{
  // First shader with atomics (v1)
  uint read_value1 = atomicAdd(buf.count, 2);
  uint read_value2 = atomicAdd(buf_coherent.count, 4);

  // Second shader with non-atomic add (v2)
  buf.count += 2;
  buf_coherent.count += 4;
}

The second shader is used to compare the effects of the coherent qualifier between atomic operations and non-atomic operations.

AMD

AMD publishes Instruction Set Architecture (ISA) Documents which coupled with the Radeon GPU Analyzer gives insight into how GPUs actually implement this.

Shader v1 (Vega gfx900)

s_getpc_b64           s[0:1]                   BE801C80
s_mov_b32             s0, s2                   BE800002
s_mov_b64             s[2:3], exec             BE82017E
s_ff1_i32_b64         s4, exec                 BE84117E
s_lshl_b64            s[4:5], 1, s4            8E840481
s_and_b64             s[4:5], s[4:5], exec     86847E04
s_and_saveexec_b64    s[4:5], s[4:5]           BE842004
s_cbranch_execz       label_0010               BF880008
s_load_dwordx4        s[8:11], s[0:1], 0x00    C00A0200 00000000
s_bcnt1_i32_b64       s2, s[2:3]               BE820D02
s_mulk_i32            s2, 0x0002               B7820002
v_mov_b32             v0, s2                   7E000202
s_waitcnt             lgkmcnt(0)               BF8CC07F
buffer_atomic_add     v0, v0, s[8:11], 0       E1080000 80020000
label_0010:
s_mov_b64             exec, s[4:5]             BEFE0104
s_mov_b64             s[2:3], exec             BE82017E
s_ff1_i32_b64         s4, exec                 BE84117E
s_lshl_b64            s[4:5], 1, s4            8E840481
s_and_b64             s[4:5], s[4:5], exec     86847E04
s_and_saveexec_b64    s[4:5], s[4:5]           BE842004
s_cbranch_execz       label_001F               BF880008
s_load_dwordx4        s[8:11], s[0:1], 0x20    C00A0200 00000020
s_bcnt1_i32_b64       s0, s[2:3]               BE800D02
s_mulk_i32            s0, 0x0004               B7800004
v_mov_b32             v0, s0                   7E000200
s_waitcnt             lgkmcnt(0)               BF8CC07F
buffer_atomic_add     v0, v0, s[8:11], 0       E1080000 80020000
label_001F:
s_endpgm                                       BF810000

(Don't know why the exec mask and branching is used here...)

We can see that both atomic operations (on coherent and non-coherent buffers) result in the same instruction on all supported architectures of the Radeon GPU Analyzer:

buffer_atomic_add     v0, v0, s[8:11], 0       E1080000 80020000

Decoding this instruction shows that the GLC (Globally Coherent) flag is set to 0 which means for atomic operations: "Previous data value is not returned. No L1 persistence across wavefronts". Modifying the shader to use the returned values changes the GLC flag of both atomic instructions to 1 which means: "Previous data value is returned. No L1 persistence across wavefronts".

The documents dating from 2013 (Sea Islands, etc.) have an interesting description of the BUFFER_ATOMIC_<op> instructions:

Buffer object atomic operation. Always globally coherent.

So on AMD hardware, it appears coherent has no effect for atomic operations.

Shader v2 (Vega gfx900)

s_getpc_b64           s[0:1]                   BE801C80
s_mov_b32             s0, s2                   BE800002
s_load_dwordx4        s[4:7], s[0:1], 0x00     C00A0100 00000000
s_waitcnt             lgkmcnt(0)               BF8CC07F
buffer_load_dword     v0, v0, s[4:7], 0        E0500000 80010000
s_load_dwordx4        s[0:3], s[0:1], 0x20     C00A0000 00000020
s_waitcnt             vmcnt(0)                 BF8C0F70
v_add_u32             v0, 2, v0                68000082
buffer_store_dword    v0, v0, s[4:7], 0 glc    E0704000 80010000
s_waitcnt             lgkmcnt(0)               BF8CC07F
buffer_load_dword     v0, v0, s[0:3], 0 glc    E0504000 80000000
s_waitcnt             vmcnt(0)                 BF8C0F70
v_add_u32             v0, 4, v0                68000084
buffer_store_dword    v0, v0, s[0:3], 0 glc    E0704000 80000000
s_endpgm                                       BF810000

The buffer_load_dword operation on the coherent buffer uses the glc flag and the other one does not as expected.

On AMD: +1 for omitting coherent

NVIDIA

It's possible to get the assembly of a shader by inspecting the blob returned by glGetProgramBinary(). The instructions are described in NV_gpu_program4, NV_gpu_program5 and NV_gpu_program5_mem_extended.

Shader v1

!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
ATOMB.ADD.U32 R0.x, {2, 0, 0, 0}, sbo_buf0[0];
ATOMB.ADD.U32 R0.x, {4, 0, 0, 0}, sbo_buf1[0];
END

There is no difference whether coherent is present or not.

Shader v2

!!NVcp5.0
OPTION NV_internal;
OPTION NV_shader_storage_buffer;
OPTION NV_bindless_texture;
GROUP_SIZE 512;
STORAGE sbo_buf0[] = { program.storage[0] };
STORAGE sbo_buf1[] = { program.storage[1] };
STORAGE sbo_buf2[] = { program.storage[2] };
TEMP R0;
TEMP T;
LDB.U32 R0.x, sbo_buf0[0];
ADD.U R0.x, R0, {2, 0, 0, 0};
STB.U32 R0, sbo_buf0[0];
LDB.U32.COH R0.x, sbo_buf1[0];
ADD.U R0.x, R0, {4, 0, 0, 0};
STB.U32 R0, sbo_buf1[0];
END

The LDB.U32 operation on the coherent buffer uses the COH modifier which means "Make LOAD and STORE operations use coherent caching".

On NVIDIA: +1 for omitting coherent

SPIR-V (with Vulkan target)

Let's see what SPIR-V code is generated by the glslang SPIR-V generator.

Shader v1

// Generated with glslangValidator.exe -H --target-env vulkan1.1
// Module Version 10300
// Generated by (magic number): 80008
// Id's are bound by 30

                              Capability Shader
               1:             ExtInstImport  "GLSL.std.450"
                              MemoryModel Logical GLSL450
                              EntryPoint GLCompute 4  "main"
                              ExecutionMode 4 LocalSize 512 1 1
                              Source GLSL 460
                              Name 4  "main"
                              Name 8  "read_value1"
                              Name 9  "Buf"
                              MemberName 9(Buf) 0  "count"
                              Name 11  "buf"
                              Name 20  "read_value2"
                              Name 21  "Buf_coherent"
                              MemberName 21(Buf_coherent) 0  "count"
                              Name 23  "buf_coherent"
                              MemberDecorate 9(Buf) 0 Restrict
                              MemberDecorate 9(Buf) 0 Offset 0
                              Decorate 9(Buf) Block
                              Decorate 11(buf) DescriptorSet 0
                              Decorate 11(buf) Binding 0
                              MemberDecorate 21(Buf_coherent) 0 Coherent
                              MemberDecorate 21(Buf_coherent) 0 Restrict
                              MemberDecorate 21(Buf_coherent) 0 Offset 0
                              Decorate 21(Buf_coherent) Block
                              Decorate 23(buf_coherent) DescriptorSet 0
                              Decorate 23(buf_coherent) Binding 1
                              Decorate 29 BuiltIn WorkgroupSize
               2:             TypeVoid
               3:             TypeFunction 2
               6:             TypeInt 32 0
               7:             TypePointer Function 6(int)
          9(Buf):             TypeStruct 6(int)
              10:             TypePointer StorageBuffer 9(Buf)
         11(buf):     10(ptr) Variable StorageBuffer
              12:             TypeInt 32 1
              13:     12(int) Constant 0
              14:             TypePointer StorageBuffer 6(int)
              16:      6(int) Constant 2
              17:      6(int) Constant 1
              18:      6(int) Constant 0
21(Buf_coherent):             TypeStruct 6(int)
              22:             TypePointer StorageBuffer 21(Buf_coherent)
23(buf_coherent):     22(ptr) Variable StorageBuffer
              25:      6(int) Constant 4
              27:             TypeVector 6(int) 3
              28:      6(int) Constant 512
              29:   27(ivec3) ConstantComposite 28 17 17
         4(main):           2 Function None 3
               5:             Label
  8(read_value1):      7(ptr) Variable Function
 20(read_value2):      7(ptr) Variable Function
              15:     14(ptr) AccessChain 11(buf) 13
              19:      6(int) AtomicIAdd 15 17 18 16
                              Store 8(read_value1) 19
              24:     14(ptr) AccessChain 23(buf_coherent) 13
              26:      6(int) AtomicIAdd 24 17 18 25
                              Store 20(read_value2) 26
                              Return
                              FunctionEnd

The only difference between buf and buf_coherent is the decoration of the latter with MemberDecorate 21(Buf_coherent) 0 Coherent. Their usage afterwards is identical.

Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and produces these (abbreviated) changes:

                              Capability Shader
+                             Capability VulkanMemoryModelKHR
+                             Extension  "SPV_KHR_vulkan_memory_model"
               1:             ExtInstImport  "GLSL.std.450"
-                             MemoryModel Logical GLSL450
+                             MemoryModel Logical VulkanKHR
                              EntryPoint GLCompute 4  "main"
                              
                              Decorate 11(buf) Binding 0
-                             MemberDecorate 21(Buf_coherent) 0 Coherent
                              MemberDecorate 21(Buf_coherent) 0 Restrict

which means... I don't quite know because I'm not versed in Vulkan's intricacies. I did found this informative section of the "Memory Model" appendix in the Vulkan 1.2 spec:

While GLSL (and legacy SPIR-V) applies the “coherent” decoration to variables (for historical reasons), this model treats each memory access instruction as having optional implicit availability/visibility operations. GLSL to SPIR-V compilers should map all (non-atomic) operations on a coherent variable to Make{Pointer,Texel}{Available}{Visible} flags in this model.

Atomic operations implicitly have availability/visibility operations, and the scope of those operations is taken from the atomic operation’s scope.

Shader v2

(skipping full output)

The only difference between buf and buf_coherent is again MemberDecorate 18(Buf_coherent) 0 Coherent.

Adding #pragma use_vulkan_memory_model to the shader enables the Vulkan memory model and produces these (abbreviated) changes:

-                             MemberDecorate 18(Buf_coherent) 0 Coherent

-             23:      6(int) Load 22
-             24:      6(int) IAdd 23 21
-             25:     13(ptr) AccessChain 20(buf_coherent) 11
-                             Store 25 24
+             23:      6(int) Load 22 MakePointerVisibleKHR NonPrivatePointerKHR 24
+             25:      6(int) IAdd 23 21
+             26:     13(ptr) AccessChain 20(buf_coherent) 11
+                             Store 26 25 MakePointerAvailableKHR NonPrivatePointerKHR 24

Notice the addition of MakePointerVisibleKHR and MakePointerAvailableKHR that control operation coherency at the instruction level instead of the variable level.

+1 for omitting coherent (maybe?)

CUDA

The Parallel Thread Execution ISA section of the CUDA Toolkit documentation has this information:

8.5. Scope

Each strong operation must specify a scope, which is the set of threads that may interact directly with that operation and establish any of the relations described in the memory consistency model. There are three scopes:

Table 18. Scopes

  • .cta: The set of all threads executing in the same CTA as the current thread.
  • .gpu: The set of all threads in the current program executing on the same compute device as the current thread. This also includes other kernel grids invoked by the host program on the same compute device.
  • .sys The set of all threads in the current program, including all kernel grids invoked by the host program on all compute devices, and all threads constituting the host program itself.

Note that the warp is not a scope; the CTA is the smallest collection of threads that qualifies as a scope in the memory consistency model.

Regarding CTA:

A cooperative thread array (CTA) is a set of concurrent threads that execute the same kernel program. A grid is a set of CTAs that execute independently.

So in GLSL terms, CTA == work group and grid == glDispatchCompute call.

The atom instruction description:

9.7.12.4. Parallel Synchronization and Communication Instructions: atom

Atomic reduction operations for thread-to-thread communication.

[...]

The optional .scope qualifier specifies the set of threads that can directly observe the memory synchronizing effect of this operation, as described in the Memory Consistency Model.

[...]

If no scope is specified, the atomic operation is performed with .gpu scope.

So by default, all shader invocations of a glDispatchCompute would see the result of an atomic operation... unless the GLSL compiler generates something that uses the cta scope in which case it would only be visible inside the workgroup. This latter case however corresponds to shared GLSL variables so perhaps it's only used for those and not for SSBO operations. NVIDIA isn't very open about this process so I haven't found a way to tell for sure (perhaps with glGetProgramBinary). However, since the semantics of cta map to a work group and gpu to buffers (i.e. SSBO, images, etc), I declare:

+0.5 for omitting coherent

Empirical evidence

I have written a particle system compute shader that uses an SSBO backed variable as an operand to atomicAdd() and it works. Usage of of coherent was not necessary even with a work group size of 512. However, there was never more than 1 work group. This was tested mainly on an Nvidia GTX 1080 so as seen above, atomic operations on NVIDIA seem to always be at least visible inside the work group.

+0.25 for omitting coherent

Coaction answered 27/2, 2020 at 13:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.