Per-element atomicity of vector load/store and gather/scatter?

Consider an array like atomic<int32_t> shared_array[]. What if you want to SIMD vectorize for(...) sum += shared_array[i].load(memory_order_relaxed)?. Or to search an array for the first non-zero element, or zero a range of it? It's probably rare, but consider any use-case where tearing within an element is not allowed, but reordering between elements is fine. (Perhaps a search to find a candidate for a CAS).

I think x86 aligned vector loads/stores would be safe in practice to use on for SIMD with mo_relaxed operations, because any tearing will only happen at 8B boundaries at worst on current hardware (because that's what makes naturally-aligned 8B accesses atomic¹). Unfortunately Intel's manuals only say:

"An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses."

There's no guarantee that those component accesses are naturally aligned, non-overlapping, or anything else. (Fun fact: x87 10-byte fld m80 loads done with 2 load uops and 2 ALU uops on Haswell, according to Agner Fog, presumably qword + word.)

If you wanted to vectorize in a future-proof way that current x86 manuals say will work on all future x86 CPUs, you could load / store in 8B chunks with movq / movhps.

Or maybe you could use 256b vpmaskmovd with an all-true mask, because the Operation section of the manual defines it in terms of multiple separate 32-bit loads, like Load_32(mem + 4). Does that mean each element acts as a separate 32-bit access, guaranteeing atomicity within that element?

(On real hardware, it's 1 load and 2 port5 uops on Haswell, or on Ryzen just 1 or 2 load+ALU uops (128 / 256). I assume that's for the case where no exceptions need to be suppressed from from elements that go into an unmapped page, since that can be slower (but IDK if it needs a microcode assist). Anyway, this tells us it's at least as atomic as a normal vmovdqa load on Haswell, but that tells us nothing about an x86 Deathstation 9000 where 16B / 32B vector accesses are broken into single-byte accesses so there can be tearing within each element.

I think in reality it's safe to assume that you won't see tearing within a 16, 32 or 64-bit element for aligned vector loads/stores on any real x86 CPU, because that wouldn't make sense for an efficient implementation that already has to keep naturally-aligned 64-bit scalar stores atomic, but it's interesting to know how far the guarantees in the manuals actually go.)

Gather (AVX2,AVX512) / Scatter (AVX512)

Instructions like vpgatherdd are more obviously composed of multiple separate 32b or 64b accesses. The AVX2 form is documented as doing multiple FETCH_32BITS(DATA_ADDR); so presumably this is covered by the usual atomicity guarantees, and each element will be gathered atomically if it doesn't cross a boundary.

AVX512 gathers are documented in Intel's PDF insn ref manual as
DEST[i+31:i] <- MEM[BASE_ADDR + SignExtend(VINDEX[i+31:i]) * SCALE + DISP]), 1) for each element separately. (Ordering: Elements may be gathered in any order, but faults must be delivered in a right-to-left order. Memory ordering with other instructions follows the Intel- 64 memory-ordering model.)

AVX512 scatters are documented (page 1802 of the prev link) the same way. Atomicity isn't mentioned, but they do cover some interesting corner cases:

If two or more destination indices completely overlap, the “earlier” write(s) may be skipped.

Elements may be scattered in any order, but faults must be delivered in a right-to left order

If this instruction overwrites itself and then takes a fault, only a subset of elements may be completed before the fault is delivered (as described above). If the fault handler completes and attempts to re-execute this instruction, the new instruction will be executed, and the scatter will not complete.

Only writes to overlapping vector indices are guaranteed to be ordered with respect to each other (from LSB to MSB of the source registers). Note that this also include partially overlapping vector indices. Writes that are not overlapped may happen in any order. Memory ordering with other instructions follows the Intel-64 memory ordering model. Note that this does not account for non-overlapping indices that map into the same physical address locations.

(i.e. because the same physical page is mapped into virtual memory at two different virtual addresses. So overlap detection is allowed to happen before (or in parallel with) address translation without rechecking after.)

I included the last two because they're interesting corner cases that I hadn't even thought to wonder about. The self-modifying case is hilarious, though I think rep stosd would have the same issue (it's also interruptible, using rcx to track progress).

I think atomicity is part of the Intel-64 memory ordering model, so the fact that they mention it and don't say anything else seems to imply that the per-element accesses are atomic. (Gathering two adjacent 4B elements almost certainly does not count as a single 8B access.)

Which vector load/store instructions are guaranteed by x86 manuals to be atomic on a per-element basis?

Experimental testing on real hardware would almost certainly tell me that everything is atomic on my Skylake CPU, and that's not what this question is about. I'm asking if my interpretation of the manuals is correct for vmaskmov / vpmaskmov loads, and for gather/scatter.

(If there's any reason to doubt that real hardware will continue to be element-wise atomic for simple movdqa loads, that would be a useful answer, too.)

Footnote: x86 atomicity basics:

In x86, naturally-aligned loads and stores of 8B or narrower are guaranteed to be atomic, according to Intel and AMD manuals. In fact, for cached accesses, any access that doesn't cross an 8B boundary is also atomic. (On Intel P6 and later give a stronger guarantee than AMD: unaligned within a cache line (e.g. 64B) is atomic for cached accesses).

Vector loads/stores of 16B or wider are not guaranteed to be atomic. They are on some CPUs (at least for cached accesses when the observers are other CPUs), but even 16B-wide atomic access to L1D cache doesn't make it atomic. For example, the HyperTransport coherency protocol between sockets for AMD K10 Opterons introduces tearing between halves of an aligned 16B vector, even though testing on threads in the same socket (physical CPU) shows no tearing.

(If you need a full 16B atomic load or store, you can hack one up with lock cmpxchg16b like gcc does for std::atomic<T>, but that's terrible for performance. See also Atomic double floating point or SSE/AVX vector load/store on x86_64.)

Recommended topics

Hot tags