Per-element atomicity of vector load/store and gather/scatter?
Asked Answered
B

0

40

Consider an array like atomic<int32_t> shared_array[]. What if you want to SIMD vectorize for(...) sum += shared_array[i].load(memory_order_relaxed)?. Or to search an array for the first non-zero element, or zero a range of it? It's probably rare, but consider any use-case where tearing within an element is not allowed, but reordering between elements is fine. (Perhaps a search to find a candidate for a CAS).

I think x86 aligned vector loads/stores would be safe in practice to use on for SIMD with mo_relaxed operations, because any tearing will only happen at 8B boundaries at worst on current hardware (because that's what makes naturally-aligned 8B accesses atomic1). Unfortunately Intel's manuals only say:

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses."

There's no guarantee that those component accesses are naturally aligned, non-overlapping, or anything else. (Fun fact: x87 10-byte fld m80 loads done with 2 load uops and 2 ALU uops on Haswell, according to Agner Fog, presumably qword + word.)

If you wanted to vectorize in a future-proof way that current x86 manuals say will work on all future x86 CPUs, you could load / store in 8B chunks with movq / movhps.

Or maybe you could use 256b vpmaskmovd with an all-true mask, because the Operation section of the manual defines it in terms of multiple separate 32-bit loads, like Load_32(mem + 4). Does that mean each element acts as a separate 32-bit access, guaranteeing atomicity within that element?

(On real hardware, it's 1 load and 2 port5 uops on Haswell, or on Ryzen just 1 or 2 load+ALU uops (128 / 256). I assume that's for the case where no exceptions need to be suppressed from from elements that go into an unmapped page, since that can be slower (but IDK if it needs a microcode assist). Anyway, this tells us it's at least as atomic as a normal vmovdqa load on Haswell, but that tells us nothing about an x86 Deathstation 9000 where 16B / 32B vector accesses are broken into single-byte accesses so there can be tearing within each element.

I think in reality it's safe to assume that you won't see tearing within a 16, 32 or 64-bit element for aligned vector loads/stores on any real x86 CPU, because that wouldn't make sense for an efficient implementation that already has to keep naturally-aligned 64-bit scalar stores atomic, but it's interesting to know how far the guarantees in the manuals actually go.)


Gather (AVX2,AVX512) / Scatter (AVX512)

Instructions like vpgatherdd are more obviously composed of multiple separate 32b or 64b accesses. The AVX2 form is documented as doing multiple FETCH_32BITS(DATA_ADDR); so presumably this is covered by the usual atomicity guarantees, and each element will be gathered atomically if it doesn't cross a boundary.

AVX512 gathers are documented in Intel's PDF insn ref manual as
DEST[i+31:i] <- MEM[BASE_ADDR + SignExtend(VINDEX[i+31:i]) * SCALE + DISP]), 1) for each element separately. (Ordering: Elements may be gathered in any order, but faults must be delivered in a right-to-left order. Memory ordering with other instructions follows the Intel- 64 memory-ordering model.)

AVX512 scatters are documented (page 1802 of the prev link) the same way. Atomicity isn't mentioned, but they do cover some interesting corner cases:

  • If two or more destination indices completely overlap, the “earlier” write(s) may be skipped.

  • Elements may be scattered in any order, but faults must be delivered in a right-to left order

  • If this instruction overwrites itself and then takes a fault, only a subset of elements may be completed before the fault is delivered (as described above). If the fault handler completes and attempts to re-execute this instruction, the new instruction will be executed, and the scatter will not complete.

  • Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB of the source registers). Note that this also include partially overlapping vector indices. Writes that are not overlapped may happen in any order. Memory ordering with other instructions follows the Intel-64 memory ordering model. Note that this does not account for non-overlapping indices that map into the same physical address locations.

(i.e. because the same physical page is mapped into virtual memory at two different virtual addresses. So overlap detection is allowed to happen before (or in parallel with) address translation without rechecking after.)

I included the last two because they're interesting corner cases that I hadn't even thought to wonder about. The self-modifying case is hilarious, though I think rep stosd would have the same issue (it's also interruptible, using rcx to track progress).

I think atomicity is part of the Intel-64 memory ordering model, so the fact that they mention it and don't say anything else seems to imply that the per-element accesses are atomic. (Gathering two adjacent 4B elements almost certainly does not count as a single 8B access.)


Which vector load/store instructions are guaranteed by x86 manuals to be atomic on a per-element basis?

Experimental testing on real hardware would almost certainly tell me that everything is atomic on my Skylake CPU, and that's not what this question is about. I'm asking if my interpretation of the manuals is correct for vmaskmov / vpmaskmov loads, and for gather/scatter.

(If there's any reason to doubt that real hardware will continue to be element-wise atomic for simple movdqa loads, that would be a useful answer, too.)


  1. Footnote: x86 atomicity basics:

In x86, naturally-aligned loads and stores of 8B or narrower are guaranteed to be atomic, according to Intel and AMD manuals. In fact, for cached accesses, any access that doesn't cross an 8B boundary is also atomic. (On Intel P6 and later give a stronger guarantee than AMD: unaligned within a cache line (e.g. 64B) is atomic for cached accesses).

Vector loads/stores of 16B or wider are not guaranteed to be atomic. They are on some CPUs (at least for cached accesses when the observers are other CPUs), but even 16B-wide atomic access to L1D cache doesn't make it atomic. For example, the HyperTransport coherency protocol between sockets for AMD K10 Opterons introduces tearing between halves of an aligned 16B vector, even though testing on threads in the same socket (physical CPU) shows no tearing.

(If you need a full 16B atomic load or store, you can hack one up with lock cmpxchg16b like gcc does for std::atomic<T>, but that's terrible for performance. See also Atomic double floating point or SSE/AVX vector load/store on x86_64.)

Benjaminbenji answered 2/9, 2017 at 9:56 Comment(7)
Could also tag [avx2], [atomicity], and maybe [assembly]. Relevant for people using intrinsics, but not about it. The 5-tag limit is a problem sometimes.Benjaminbenji
I don't think we're meant to infer such details from the pseudocode blocks, they've been wrong/inaccurate before (eg LDDQU is shown as a single load but it's allowed to load twice even when the address is aligned)Meaningless
@harold: For gathers and scatters, they take the trouble to mention the memory model. Usefully interacting with other threads requires atomicity at a minimum, so I think that implies gather/scatter are per-element atomic, even if the Operation pseudocode shouldn't be taken that way.Benjaminbenji
I have a hunch that the answer is no. In the documentation for the very similar instruction vmaskmovps, there is a tiny addition to a similar sentence in vpmaskmovd: VMASKMOV should not be used to access memory mapped I/O and un-cached memory as the access and the ordering of the individual loads or stores it does is implementation specific. The bold is the difference. This seems to indicate that Intel reserves the right to implement the vector's individual accesses however it wants. That "the access and" is omitted from vpmaskmovd seems to be an omission.Silhouette
@IwillnotexistIdonotexist: Well spotted. I think that means you could get one wide MMIO write for adjacent unmasked elements, or two narrow ones. But doing one aligned 8-byte atomic store still does give atomicity for the 4-byte halves, even though it's different for MMIO. So I don't think it rules out per-element atomicity, because the implementation-specific part might only be coalescing of element stores into wider and still-atomic stores.Benjaminbenji
If you don't write assembly directly but C/C++ and don't explicitly use atomic operations, there is also the risk that the compiler might turn your nice load into something else, say a memcpy, which could be implemented as copying the bytes one by one... I am hitting a situation similar to what you describe, I have intervals (__m128d), when I update the interval it is fine to update one bound at a time, but each double should be a true value, never a teared mix of several values. Doing 2 atomic load (or store) each time would cost too much because compilers don't optimize atomics...Cabriole
@MarcGlisse: Yeah, that's always been a problem for C, which allows DeathStation 9000 implementations to compile in horrible ways. But even with hand-rolled atomics using volatile T*, this wasn't ever (AFAIK) a problem in practice for objects of the right sizes to be atomic. It does mean you're at the mercy of quality-of-implementation, not ISO C guaranteed behaviour. I think any real-world compiler would at worst split a __m128d in half (like gcc unaligned __m256d but for __m128 on very old CPUs), not other sizes or bad memcpy.Benjaminbenji

© 2022 - 2025 — McMap. All rights reserved.