Weak guarantees for non-atomic writes on GPUs?

OpenCL and CUDA have included atomic operations for several years now (although obviously not every CUDA or OpenCL device supports these). But - my question is about the possibility of "living with" races due to non-atomic writes.

Suppose several threads in a grid all write to the same location in global memory. Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?

Relevant parameters for this question (choose any combination(s), edit except nVIDIA+CUDA which already got an answer):

Memory space: Global memory only; this question is not about local/shared/private memory.
Alignment: Within a single memory write width (e.g. 128 bits on nVIDIA GPUs)
GPU Manufacturer: AMD / nVIDIA
Programming framework: CUDA / OpenCL
Position of store instruction in code: Same line of code for all threads / different lines of code.
Write destination: Fixed address / fixed offset from the address of a function parameter / completely dynamic
Write width: 8 / 32 / 64 bits.

Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?

For current CUDA GPUs, and I'm pretty sure for NVIDIA GPUs with OpenCL, the answer is yes. Most of my terminology below will have CUDA in view. If you require an exhaustive answer for both CUDA and OpenCL, let me know, and I'll delete this answer. Very similar questions to this one have been asked, and answered, before anyway. Here's another, and I'm sure there are others.

When multiple "simultaneous" writes occur to the same location, one of them will win, intact.

Which one will win is undefined. The behavior of the non-winning writes is also undefined (they may occur, but be replaced by the winner, or they may not occur at all.) The actual contents of the memory location may transit through various values (such as the original value, plus any of the valid written values), but the transit will not pass through "junk" values (i.e. values that were not already there and were not written by any thread.) The transit ends up at the "winner", eventually.

Example 1:

Location X contains zero. Threads 1,5,32, 30000, and 450000 all write one to that location. If there is no other write traffic to that location, that location will eventually contain the value of one (at kernel termination, or earlier).

Example 2:

Location X contains 5. Thread 32 writes 1 to X. Thread 90303 writes 7 to X. Thread 432322 writes 972 to X. If there is no other write traffic to that location, upon kernel termination, or earlier, location X will contain either 1, 7 or 972. It will not contain any other value, including 5.

I'm assuming X is in global memory, and all traffic to it is naturally aligned to it, and all traffic to it is of the same size, although these principles apply to shared memory as well. I'm also assuming you have not violated CUDA programming principles, such as the requirement for naturally aligned traffic to device memory locations. The transactions I have in view here are those transactions that originate from a single SASS instruction (per thread) Such transactions can have a width of 1,2,4,or 8bytes. The claims I've made here apply whether the writes are originating from "the same line of code" or "different lines".

These claims are based on the PTX memory consistency model, and therefore the "correctness" is ensured by the GPU hardware, not by the compiler, the CUDA programming model, or the C++ standard that CUDA is based on.

This is a fairly complex topic (especially when we factor in cache behavior, and what to expect when we throw reads in the mix), but "junk" values should never occur. The only values that should occur in global memory are those values that were there to begin with, or those values that were written by some thread, somewhere.

Recommended topics

Hot tags