Atomic operations in CUDA kernels on mapped pinned host memory: to do or not to do?

Asked 21/4, 2014 at 7:46 Answered 22/4, 2014 at 17:19

In CUDA programming guide it is stated that atomic operations on mapped pinned host memory "are not atomic from the point of view of the host or other devices." What I get from this sentence is that if the host memory region is accessed only by one GPU, it is fine to do atomic on the mapped pinned host memory (even from within multiple simultaneous kernels).

On the other hand, in the book the CUDA Handbook by Nicholas Wilt at page 128 it is stated that:

Do not try to use atomics on mapped pinned host memory, either for the host (locked compare-exchange) or the device (atomicAdd()). On the CPU side, the facilities to enforce mutual exclusion for locked operations are not visible to peripherals on the PCI express bus. Conversely, on the GPU side, atomic operations only work on local device memory locations because they are implemented using the GPU's local memory controller.

Is is safe to do atomic from inside a CUDA kernel on mapped pinned host memory? Can we rely on PCI-e bus to keep the atomicity of atomics' read-modify-write?

Nappie answered 21/4, 2014 at 7:46 Comment(3)

I don't believe it can be made to work. Nick Wilt posts on SO as archaeasoftware. I hope he finds this question and answers it. I am pretty sure he will reiterate what is in the book. – Toname 21/4, 2014 at 9:8

Well, my simple tests don't fail, and using atomics is essential to getting a correct answer for my test. I'd be interested to see a counter example (something that fails on atomic updates of mapped pinned memory from a single GPU). – Kenton 21/4, 2014 at 14:32

You haven't said whether you are doing CPU or GPU atomics on the mapped pinned host memory. – Benne 22/4, 2014 at 17:15

The caution is intended for people who are using mapped pinned memory to coordinate execution between the CPU and GPU, or between multiple GPUs. When I wrote that, I did not expect anyone to use such a mechanism in the single-GPU case because CUDA provides so many other, better ways to coordinate execution between the CPU(s) and a single GPU.

If there is strictly a producer/consumer relationship between the CPU and GPU (i.e. the producer is updating the memory location and the consumer is passively reading it), that can be expected to work under certain circumstances.

If the GPU is the producer, the CPU would see updates to the memory location as they get posted out of the GPU’s L2 cache. But the GPU code may have to execute memory barriers to force that to happen; and even if that code works on x86, it’d likely break on ARM without heroic measures because ARM does not snoop bus traffic.

If the CPU is the producer, the GPU would have to bypass the L2 cache because it is not coherent with CPU memory.

If the CPU and GPU are trying to update the same memory location concurrently, there is no mechanism to ensure atomicity between the two. Doing CPU atomics will ensure that the update it atomic with respect to CPU code, and doing GPU atomics will ensure that the update is atomic with respect to the GPU that is doing the update.

All of the foregoing discussion assumes there is only one GPU; if multiple GPUs are involved, all bets are off. Although atomics are provided for in the PCI Express 3.0 bus specification, I don’t believe they are supported by NVIDIA GPUs. And support in the underlying platform also is not guaranteed.

It seems to me that whatever a developer may be trying to accomplish by doing atomics on mapped pinned memory, there’s probably a method that is faster, more likely to work, or both.

Benne answered 22/4, 2014 at 17:19 Comment(4)

The reason to ask such question was to find out if Cuckoo hashing is possible when the device cannot hold the hash table. Cuckoo hashing mainly relies on 64-bit atomic exchange on the table entries. – Nappie 22/4, 2014 at 18:7

If you are just trying to work around the GPU's limit of physical memory, it may work; but I expect it will be slow. If you build it, I'd love to hear about the results! – Benne 23/4, 2014 at 3:22

We're actually proposing a new hashing approach appropriate for GPUs, and AFAIK cuckoo hashing is the state-of-the-art GPU solution out there to compare. Being slow when exceeding global memory capacity seems to be one of limitations of cuckoo hashing on GPUs. – Nappie 23/4, 2014 at 5:59

Any algorithm that spills out of global memory will take a big performance hit. – Benne 24/4, 2014 at 1:35

Yes, this works atomically from a single GPU. So if no other CPU or GPU is accessing the memory it will be atomic. Atomics are implemented in the L2 cache and the CROP (on various GPUs), and both can handle system memory accesses.

It will be slow, though. This memory is not cached on the GPU.

When Nick says, "the facilities to enforce mutual exclusion for locked operations are not visible to peripherals on the PCI express bus", it makes me think he's referring to the lack of atomicity when accessing that memory from both processors, which is correct.

Tryck answered 21/4, 2014 at 17:21 Comment(2)

Yeah, my guess is he is referring to the lack of coherency across the PCI-e bus during a running kernel – Toname 21/4, 2014 at 18:34

System memory accesses aren't cached on the GPU? That is news to me. I believe if there's reuse, the L2 will happily service reads from system memory just as it does device memory. In fact, I can think of at least one public demo that relied heavily on that feature. – Benne 22/4, 2014 at 17:8

Recommended topics

Hot tags