Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?
For current CUDA GPUs, and I'm pretty sure for NVIDIA GPUs with OpenCL, the answer is yes. Most of my terminology below will have CUDA in view. If you require an exhaustive answer for both CUDA and OpenCL, let me know, and I'll delete this answer. Very similar questions to this one have been asked, and answered, before anyway. Here's another, and I'm sure there are others.
When multiple "simultaneous" writes occur to the same location, one of them will win, intact.
Which one will win is undefined. The behavior of the non-winning writes is also undefined (they may occur, but be replaced by the winner, or they may not occur at all.) The actual contents of the memory location may transit through various values (such as the original value, plus any of the valid written values), but the transit will not pass through "junk" values (i.e. values that were not already there and were not written by any thread.) The transit ends up at the "winner", eventually.
Example 1:
Location X contains zero. Threads 1,5,32, 30000, and 450000 all write one to that location. If there is no other write traffic to that location, that location will eventually contain the value of one (at kernel termination, or earlier).
Example 2:
Location X contains 5. Thread 32 writes 1 to X. Thread 90303 writes 7 to X. Thread 432322 writes 972 to X. If there is no other write traffic to that location, upon kernel termination, or earlier, location X will contain either 1, 7 or 972. It will not contain any other value, including 5.
I'm assuming X is in global memory, and all traffic to it is naturally aligned to it, and all traffic to it is of the same size, although these principles apply to shared memory as well. I'm also assuming you have not violated CUDA programming principles, such as the requirement for naturally aligned traffic to device memory locations. The transactions I have in view here are those transactions that originate from a single SASS instruction (per thread) Such transactions can have a width of 1,2,4,or 8bytes. The claims I've made here apply whether the writes are originating from "the same line of code" or "different lines".
These claims are based on the PTX memory consistency model, and therefore the "correctness" is ensured by the GPU hardware, not by the compiler, the CUDA programming model, or the C++ standard that CUDA is based on.
This is a fairly complex topic (especially when we factor in cache behavior, and what to expect when we throw reads in the mix), but "junk" values should never occur. The only values that should occur in global memory are those values that were there to begin with, or those values that were written by some thread, somewhere.