Are one-sided RDMA reads atomic for single cache lines?
Asked Answered
E

3

6

My group (a project called Isis2) is experimenting with RDMA. We're puzzled by the lack of documentation for the atomicity guarantees of one-sided RDMA reads. I've spent the past hour and a half hunting for any kind of information at all on this to no avail. This includes close reading of the blog at rdmamojo.com, famous for having answers to every RDMA question...

In the case we are focused on, we want to have writers doing atomic writes for objects that will always fit within a single cache line. Say this happens on machine A. Then we plan to have a one-sided atomic RDMA reader on machine B, who might read chunks of memory from A, spanning many of these objects (but again, no object would ever be written non-atomically, and all will fit within some single cache line). So B reads X, Y and Z, and each of those objects lives in one cache line on A, and was written with atomic writes.

Thus the atomic writes will be local, but the RDMA reads will arrive from remote machines and are done with no local CPU involvement.

Are our one-sided reads "semantically equivalent" to atomic local reads despite being initiated on the remote machine? (I suspect so: otherwise, one-sided RDMA reads would be useless for data that is ever modified...). And where are the "rules" documented?

Estreat answered 11/11, 2015 at 13:46 Comment(1)
At best, it seems to be implementation specific.Meadow
E
2

Ok, meanwhile I seem to have found the correct answer, and I believe that Roland's response is not quite right -- partly right but not entirely.

In http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf, which is the Intel architecture manual (I'll need to check again for AMD...) I found this: Atomic memory operation in Intel 64 and IA-32 architecture is guaranteed only for a subset of memory operand sizes and alignment scenarios. The list of guaranteed atomic operations are described in Section 8.1.1 of IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A.

Then in that section, which is entitled MULTIPLE-PROCESSOR MANAGEMENT, one finds a lot of information about guaranteed atomic operations (page 2210). In particular, Intel guarantees that its memory subsystems will be atomic for native types (bit, byte, integers of various sizes, float). These objects must be aligned so as to fit within a cache line (64 bytes on the current Intel platforms), not crossing a cache line boundary. But then Intel guarantees that no matter what device is using the memory bus, stores and fetches will be atomic.

For more complex objects, locking is required if you want to be sure you will get a safe execution. Further, if you are doing multicore operations you have to use the locked (atomic) variants of the Intel instructions to be sure of coherency for concurrent writes. You get this automatically for variables marked volatile in C++ or C# (Java too?).

What this adds up to is that local writes to native types can be paired with remotely initiated RDMA reads safely.

But notice that strings, byte arrays -- those would not be atomic because they could easily cross a cache line. Also, operations on complex objects with more than one data field might not be atomic -- for such things you would need a more complex approach, such as the one in the FaRM paper (Fast Remote Memory) by MSR. My own need is simpler and won't require the elaborate version numbering scheme FaRM implements...

Estreat answered 12/11, 2015 at 20:56 Comment(0)
I
1

The cache coherence protocol implemented in the PCIe controller should guarantee atomicity for single cache line RDMA reads. The PCIe controller has to snoop the caches of CPU cores and take ownership of the cache line (RFO) before returning data to the RDMA adapter. So it should see some snapshot of the cache line.

Invertase answered 22/11, 2015 at 18:42 Comment(0)
W
0

I don't know of any such guarantee of atomicity. Of course RDMA reads are executed by the remote adapter, and cacheline size is a CPU concept. I don't believe anything ensures that the granularity of reads used by remote RDMA adapter matches the size of writes performed by the remote CPU.

In practice it is likely to work since the remote adapter will probably issue a single PCI transaction etc. but I don't think there is anything architectural that guarantees you don't get "torn" data.

Wrangler answered 12/11, 2015 at 17:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.