SSE instructions: which CPUs can do atomic 16B memory operations?
Asked Answered
D

7

35

Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.

The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.

The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.

This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html


Update:

I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.

// Compile with:
//   gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.

#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>

typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;

unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));

void* thread1(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n1[mask]++;

                x = (v4si){0,0,0,0};
        }
        return NULL;
}

void* thread2(void *arg) {
        for (int i=0; i<100*1000*1000; i++) {
                int mask = _mm_movemask_ps((__m128)x);
                n2[mask]++;

                x = (v4si){-1,-1,-1,-1};
        }
        return NULL;
}

int main() {
        // Check memory alignment
        if ( (((uintptr_t)&x) & 0x0f) != 0 )
                abort();

        memset(n1, 0, sizeof(n1));
        memset(n2, 0, sizeof(n2));

        pthread_t t1, t2;
        pthread_create(&t1, NULL, thread1, NULL);
        pthread_create(&t2, NULL, thread2, NULL);
        pthread_join(t1, NULL);
        pthread_join(t2, NULL);

        for (unsigned i=0; i<16; i++) {
                for (int j=3; j>=0; j--)
                        printf("%d", (i>>j)&1);

                printf("  %10u %10u", n1[i], n2[i]);
                if(i>0 && i<0x0f) {
                        if(n1[i] || n2[i])
                                printf("  Not a single memory access!");
                }

                printf("\n");
        }

        return 0;
}

The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:

0000    96905702      10512
0001           0          0
0010           0          0
0011          22      12924  Not a single memory access!
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100     3092557       1175  Not a single memory access!
1101           0          0
1110           0          0
1111        1719   99975389
Deering answered 4/10, 2011 at 9:48 Comment(4)
Really? And what do you think will happen when you install only 1 memory module in a Core2 motherboard? If you happen to have a Core2 CPU (or other "modern" x86-64 CPU), try installing only 1 memory module in your machine, actually run the test program I provided, and then please post your results. Thanks.Deering
To clarify the text I wrote in the comment which starts with "Really? ...". The comment was a response to a previous message from a person who believed that my StackOverflow question is related to FSB or DRAM. In a way, it is somewhat related to FSB and DRAM, but the relationship of my question to FSB and DRAM is insignificant and does not play a major role. ... The person then deleted his/her own answer and comments, thus effectively erasing them from the historical record. But, if you delete yourself from history, who will be able to remember you? Nobody.Deering
That's a really interesting discussion, thanks! It raises a question though: if 8 bytes is the maximum size of atomic memory accesses, does that mean that 80-bit extended floats are non-atomic?Chapple
@Jens: Yes, 80bit FPU loads are probably not atomic. The ISA doesn't guarantee it (across all implementations), and in practice they're probably not atomic on most recent CPUs. e.g. FLD m80 on Intel Haswell is 4 total uops: 2 ALU and 2 load-port. So it might well be implemented internally as a 64b load and a 16b load. 80bit FP is not a performance priority. 80bit FP stores take 7 uops, with a throughput of one per 5 cycles.Nonparous
R
43

In the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.1.1 that:

The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:

  • Reading or writing a byte.

  • Reading or writing a word aligned on a 16-bit boundary.

  • Reading or writing a doubleword aligned on a 32-bit boundary. The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:

  • Reading or writing a quadword aligned on a 64-bit boundary.

  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus.

The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:

  • Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line.

Processors that enumerate support for Intel® AVX (by setting the feature flag CPUID.01H:ECX.AVX[bit 28]) guarantee that the 16-byte memory operations performed by the following instructions will always be carried out atomically:

  • MOVAPD, MOVAPS, and MOVDQA.
  • VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
  • VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded with EVEX.128 and k0 (masking disabled).

(Note that these instructions require the linear addresses of their memory operands to be 16-byte aligned.)

Each of the writes x = (v4si){0,0,0,0} and x = (v4si){-1,-1,-1,-1} are probably compiled into a single 16-byte MOVAPS. The address of x is 16-byte aligned. On an Intel processor that supports AVX, these writes are atomic. Otherwise, they are not atomic.

On AMD processors, AMD64 Architecture Programmer's Manual, Section 7.3.2 Access Atomicity states that

Cacheable, naturally-aligned single loads or stores of up to a quadword are atomic on any processor model, as are misaligned loads or stores of less than a quadword that are contained entirely within a naturally-aligned quadword. Misaligned load or store accesses typically incur a small latency penalty. Model-specific relaxations of this quadword atomicity boundary, with respect to this latency penalty, may be found in a given processor's Software Optimization Guide. Misaligned accesses can be subject to interleaved accesses from other processors or cache-coherent devices which can result in unintended behavior. Atomicity for misaligned accesses can be achieved where necessary by using the XCHG instruction or any suitable LOCK-prefixed instruction. Processors that report CPUID Fn0000_0001_ECX[AVX](bit 28) = 1 extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

That is, AMD processors, similarly to Intel, do guarantee that for processors supporting AVX instructions 16-byte atomicity is provided by 16-byte load and store instructions.

On Intel and AMD processors that don't support AVX, the CMPXCHG16B instruction with the LOCK prefix can be used. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).

EDIT: Test program results

(Test program modified to increase #iterations by a factor of 10)

On a Xeon X3450 (x86-64):

0000   999998139       1572
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        1861  999998428

On a Xeon 5150 (32-bit):

0000   999243100     283087
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111      756900  999716913

On an Opteron 2435 (x86-64):

0000   999995893       1901
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          0
1101           0          0
1110           0          0
1111        4107  999998099

Note that the Intel Xeon X3450 and Xeon 5150 don't support AVX. The Opteron 2435 is an AMD processor (K10 "Istanbul") that also does not support AVX.

Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.

EDIT 2: How to make the test program fail

Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:

0000   999998634       5990
0001           0          0
0010           0          0
0011           0          0
0100           0          0
0101           0          0
0110           0          0
0111           0          0
1000           0          0
1001           0          0
1010           0          0
1011           0          0
1100           0          1  Not a single memory access!
1101           0          0
1110           0          0
1111        1366  999994009

So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.

EDIT 3: ASM for the thread functions, on request of "GJ."

Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:


.globl thread2
        .type   thread2, @function
thread2:
.LFB537:
        .cfi_startproc
        movdqa  .LC3(%rip), %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L11:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n2(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L11
        xorl    %eax, %eax
        ret
        .cfi_endproc
.LFE537:
        .size   thread2, .-thread2
        .p2align 5,,31
.globl thread1
        .type   thread1, @function
thread1:
.LFB536:
        .cfi_startproc
        pxor    %xmm1, %xmm1
        xorl    %eax, %eax
        .p2align 5,,24
        .p2align 3
.L15:
        movaps  x(%rip), %xmm0
        incl    %eax
        movaps  %xmm1, x(%rip)
        movmskps        %xmm0, %edx
        movslq  %edx, %rdx
        incl    n1(,%rdx,4)
        cmpl    $1000000000, %eax
        jne     .L15
        xorl    %eax, %eax
        ret
        .cfi_endproc

and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:


.LC3:
        .long   -1
        .long   -1
        .long   -1
        .long   -1
        .ident  "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
        .section        .note.GNU-stack,"",@progbits

Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.

Rapp answered 4/10, 2011 at 12:23 Comment(20)
No! Reading or writing should be performed atomically granted by hardware up to 32 bytes if translated memory in under one cache line. Check my answer.Newman
@GJ.No, you're wrong. The cache line size (64 bytes on most, if not all, x86 machines, FWIW) does not determine the atomicity of memory accesses.Rapp
True for RMW, but not for single read xor write! If cache bust read/write is interrupted the exception fault is rasing.Newman
Check again the Intel manual from your link section 8.1.1: Accesses to cacheable memory that are split across cache lines and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. So, in any other case is atomic.Newman
@GJ.: I have no idea what you're trying to say, but anyway, if the processor implements a 16-byte store instruction internally as 2 8-byte stores in the store pipeline (as it's allowed to do per the architectural guarantees provided in the programming manual), it's perfectly possible for another processor to "steal" the cache line in between the two stores. Unlikely yes, but not impossible, as can be seen from the failed test I show in my answer.Rapp
@janneb: About the guarantee of atomicity of the memory access: Well, in my opinion, based on complexity argument, given the results you provided it is possible to derive that 16-byte aligned memory accesses via MOVAPS and MOVDQA are always atomic. If this isn't enough proof, you might want to replace the for-loop in the test program with a loop that will take 10+ minutes to execute, and let it run while you are doing other tasks on the machine. If you will do this, please append the obtained results to your answer - I would like to know whether the complexity argument actually works.Deering
@Atom: Actually, as my last test shows, at least on the Opteron 2435, MOVDQA is NOT atomic.Rapp
@janneb: Are you running the memory controller on the Opteron 2435 setup in ganged mode or in unganged mode?Deering
@Atom: Unganged, I believe; It's usually the default these days. That being said, I think it's irrelevant for this question since the cache coherency hw will transfer the cache line back and forth without it hitting the memory controllers.Rapp
@janneb: Huh, I didn't see your second edit... But I stil doubt, because I never expirience such a problem. Can you show how debbuger see thread procedure assembler?Newman
@janneb: Thanx, I have check code and looks OK. I still can't reproduce the error on my PCs on any way.Newman
@janneb: I marked your answer as the correct one. I gave the bounty to GJ.Deering
@janneb, could you please tell me how to do "via the "numactl" tool specifying that each thread runs on a separate socket"Blowing
@DerekZhang: This was done ~8 years ago, so I don't remember exactly. Basically, checking the NUMA topology of the system (see /proc/cpuinfo), then use numactl to allow use of 2 cores on different sockets. E.g. something like "numactl --physcpubind=N,M ./a.out" where N and M are indices of two cpu cores on different sockets.Rapp
rigtorp.se/isatomic has test results for more modern CPUs. Still a total lack of documented guarantees that anything beyond 8 bytes (or lock cmpxchg16b) is atomic, though.Nonparous
@PeterCordes: I just found out that in the current version (Nov 2021, though it may have been introduced earlier but I haven't followed it) of the Intel manual volume 3A (linked from the answer), certain 16-byte memory operations are guaranteed to be atomic provided the CPU supports AVX.Rapp
@janneb: Yeah, I saw Hadi's edit! That's a game changer for compilers, and also about freaking time. (Also excellent that they tied it to an existing feature bit like AVX, so new builds can take advantage of the HW capability that's been there the whole time. Except on Pentium/Celeron budget CPUs; having the upper halves of their SIMD execution units fused off obviously has no impact on atomicity of 16-byte stuff, but they lose out anyway.)Nonparous
That EDIT2 is horrific. Literally one-in-a-billion torn accesses, in a test designed specifically to provoke it. Have fun reproducing any bugs based on that one...Maros
@PeterCordes: FYI the AMD manual now also states that processors supporting AVX provide 16-byte atomicity; I updated the URL and quote from the manual to reflect that.Rapp
Yeah, I saw your edit bump this Q&A; thanks for maintaining it :) whatishappened's answer from 2022 indicated that AMD were planning to do that, and that GNU libatomic had already updated on the assumption that the guarantee holds for AMD, but it's nice to see it got officially published.Nonparous
N
6

Update: in 2022, Intel retroactively documented that the AVX feature bit implies that aligned 128-bit loads/stores are atomic, at least for Intel CPUs. AMD could document the same thing since in practice their CPUs with AVX support have I think avoided tearing on 8-byte boundaries. See @whatishappened's answer, and janneb's updated answer.

Pentium and Celeron versions of CPUs with AVX will also have the same atomicity in practice, but no documented way for software to detect it. Also presumably Core 2 and Nehalem, and probably some low-power Silvermont-family chips, which haven't had AVX until Alder Lake E-cores.

So finally we can have cheapish atomic __int128 loads/stores on AVX CPUs in a well-documented way. (So C++ std::atomic is_lock_free() could return true on some machines. But not is_always_lock_free as a compile-time constant unless arch options make a binary that requires AVX. GCC previously used lock cmpxchg16b to implement load/store, but changed in GCC7 IIRC to not advertise that as "lock free" since it didn't have the read-side scaling you'd expect with proper support.)


Old partly-updated answer below

Erik Rigtorp has done some experimental testing on recent Intel and AMD CPUs to look for tearing. Results at https://rigtorp.se/isatomic/. Keep in mind there's no documentation or guarantee about this behaviour (beyond 128-bit or on non-AVX CPUs), and IDK if it's possible for a custom many-socket machine using such CPUs to have less atomicity than the machines he tested on. But on current x86 CPUs (not K10), SIMD atomicity for aligned loads/stores simply scales with data-path width between cache and L1d cache.



The x86 ISA only guarantees atomicity for things up to 8B, so that implementations are free to implement SSE / AVX support the way Pentium III / Pentium M / Core Duo does: internally data is handled in 64bit halves. A 128-bit store is done as two 64-bit stores. The data path to/from cache is only 64b wide in the Yonah microarchitecture (Core Duo). (source:Agner Fog's microarch doc).

More recent implementations do have wider data paths internally, and handle 128b instructions as a single op. Core 2 Duo (conroe/merom) was the first Intel P6-descended microarch with 128b data paths. (IDK about P4, but fortunately it's old enough to be totally irrelevant.)

This is why the OP finds that 128b ops are not atomic on Intel Core Duo (Yonah), but other posters find that they are atomic on later Intel designs, starting with Core 2 (Merom).

The diagrams on this Realworldtech writeup about Merom vs. Yonah show the 128bit path between ALU and L1 data-cache in Merom (and P4), while the low-power Yonah has a 64bit data path. The data path between L1 and L2 cache is 256b in all 3 designs.

The next jump in data path width came with Intel's Haswell, featuring 256b (32B) AVX/AVX2 loads/stores, and a 64Byte path between L1 and L2 cache. I expect that 256b loads/stores are atomic in Haswell, Broadwell, and Skylake, but I don't have one to test.

Skylake-AVX512 has 512-bit data paths, so they're also naturally atomic at least in reading/writing L1d cache. The ring bus (client chips) transfers in 32-byte chunks, but Intel guarantees no tearing between 32B halves, since they guarantee atomicity for 8-byte load/store at any misalignment as long as it doesn't cross a cache-line boundary.

Zen 4 handles 512-bit ops as two halves, so probably not 512-bit atomicity.


As janneb points out in his excellent experimental answer, the cache-coherency protocol between sockets in a multi-core system might be narrower than what you get within a shared-last-level-cache CPU. There is no architectural requirement on atomicity for wide loads/stores, so designers are free to make them atomic within a socket but non-atomic across sockets if that's convenient. IDK how wide the inter-socket logical data path is for AMD's Bulldozer-family, or for Intel. (I say "logical", because even if the data is transferred in smaller chunks, it might not modify a cache line until it's fully received.)


Finding similar articles about AMD CPUs should allow drawing reasonable conclusions about whether 128b ops are atomic or not. Just checking instruction tables is some help:

K8 decodes movaps reg, [mem] to 2 m-ops, while K10 and bulldozer-family decode it to 1 m-op. AMD's low-power bobcat decodes it to 2 ops, while jaguar decodes 128b movaps to 1 m-op. (It supports AVX1 similar to bulldozer-family CPUs: 256b insns (even ALU ops) are split into two 128b ops. Intel SnB only splits 256b loads/stores, while having full-width ALUs.)

janneb's Opteron 2435 is a 6-core Istanbul CPU, which is part of the K10 family, so this single-m-op -> atomic conclusion appears accurate within a single socket.

Intel Silvermont does 128b loads/stores with a single uop, and a throughput of one per clock. This is the same as for integer loads/stores, so it's quite probably atomic.

Nonparous answered 10/10, 2015 at 4:2 Comment(2)
I've seen your comment stating some guarantees are retroactively documented. Care to update the answer?Kilgore
@AlexGuteniev: Janneb's answer on this question is already updated with that info. But I guess I should correct this answer's statement about an explicit lack of vendor documentation; thanks for pointing that out. Updated.Nonparous
E
5

The "AMD Architecture Programmer's Manual Volume 1: Application Programming" says in section 3.9.1: "CMPXCHG16B can be used to perform 16-byte atomic accesses in 64-bit mode (with certain alignment restrictions)."

However, there is no such comment about SSE instructions. In fact, there is a comment in 4.8.3 that the LOCK prefix "causes an invalid-opcode exception when used with 128-bit media instructions". It therefore seems pretty conclusive to me that the AMD processors do NOT guarantee atomic 128-bit accesses for SSE instructions, and the only way to do an atomic 128-bit access is to use CMPXCHG16B.

The "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1" says in 8.1.1 "An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses." This is pretty conclusive that 128-bit SSE instructions are not guaranteed atomic by the ISA. Volume 2A of the Intel docs says of CMPXCHG16B: "This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically."

Further, CPU manufacturers haven't published written guarantees of atomic 128b SSE operations for specific CPU models where that is the case.

Ecesis answered 7/10, 2011 at 15:48 Comment(6)
CMPXCHG16B instruction means: IF MEM[ADDR]==X THEN MEM[ADDR]:=Y. This implies that you already need to know the value X that is stored in memory at ADDR before writing Y. My question assumes that the 128-bit data type at ADDR uses all of the 128 bits - thus any of the (1<<128) possible values can be stored there. So, it is impossible to know whether MEM[ADDR] equals to X or not. (Sidenote: there is a universal method of how to encode a 16-byte value using more than 16 bytes (for example: 24 bytes) so that the 16 bytes can be read/written safely. But that is a different question.)Deering
You can use CMPXCHG16B to read a value. First load any value into RDX:RAX, and load the same values into RCX:RBX. All registers zero would do. Then do CMPXCHG16B [addr]. If the values match then the same value is stored back. If they don't match then RDX:RAX is updated to the actual value. Either way RDX:RAX holds the original stored value, and the memory is unchanged.Ecesis
About your sentence "This is pretty conclusive that 128-bit SSE instructions are never atomic": There is no evidence to support this sentence on particular CPUs. The answers to my question so far support the claim that on Core2 Quad Q6600, Core2 Duo P8400, Pentium4 hyper-threading, Xeon X3450, Xeon 5150, and 1-socket Opteron 2435, the memory accesses are always atomic. Maybe the test program should be modified so that the data goes through the memory chip more often, or the test program should run for a longer period of time (1 hour) while running/doing other tasks on the machine.Deering
Sorry. I meant never guaranteed to be atomic. Particular CPUs may happen to make them atomic, and they may appear atomic under particular test conditions, but there is no guarantee.Ecesis
I think it's quite likely that some CPU designs in practice do have atomic 128b loads/stores, because their wide data paths never split the operation separate parts. The docs never say anything about this, and there isn't a CPUID bit for it, because they don't want people writing code that depends on it. (future low-power CPUs can always implement SIMD with narrower data paths.) But just because no Intel or AMD manual has the guarantee written down doesn't mean there isn't one, e.g. for Sandybridge. And probably 256b is atomic on Haswell. See my answer. Made the edit so I don't have to -1Nonparous
@AnthonyWilliams, the problem of using CMPXCHG16B to read a value is that CMPXCHG16B on read-only memory results in segfault. See gcc.gnu.org/bugzilla/show_bug.cgi?id=80878 and gcc.gnu.org/bugzilla/show_bug.cgi?id=94649Aerophone
B
3

There is actually a warning in the Intel Architecture Manual Vol 3A. Section 8.1.1 (May 2011), under the section of guaranteed atomic operations:

An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g. due an page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.4.4), such page faults may occur even if all accesses are to the same page.

thus SSE instructions are not guaranteed to be atomic, even if the underlying architecture does use a single memory access (this is one reason why the memory fencing was introduced).

Combine that with this statement from the Intel Optimization Manual, Section 13.3 (April 2011)

AVX and FMA instructions do not introduce any new guaranteed atomic memory operations.

and that fact that none of the load or store operation for SIMD guarantee atomicity, we can come to the conclusion that Intel doesn't not support any form of atomic SIMD (yet).

As an extra bit, if the memory is split along cache lines or page boundaries (when using things like movdqu which permit unaligned access), the following processors will not perform atomic accesses, regardless of alignment, but later processors will (again from the Intel Architecture Manual):

Intel Core 2 Duo, Intel® Atom™, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors

Broeker answered 7/10, 2011 at 8:42 Comment(6)
The question explicitly states that all the 16-byte memory acceses are aligned to 16 bytes. Thus none of them, by construction, crosses cache lines or page boundaries.Deering
About the wording "may be implemented using multiple memory accesses" in the Intel Architecture Manual: My question is whether there exist concrete physical CPUs which happen to be implemented so that the memory accesses are always atomic.Deering
@Necrolis: Yes, and movdqa doesn't perform any SIMD operation it is only memory to register or register to memory move.Newman
@Atom: That wording basically means: "there are none following this in our current product range, but we are free to add it in future", ie: there is no current processor that officially does this (from Intel), and tests at this level are unreliable as unofficial proof. The cache split/page boundary stuff was just some extra misc info, mainly for things like movdqu.Broeker
@GJ: its still an SSE2 instruction, so it falls under the umbrella of Streaming SIMD (especially since the loads and stores are of multiple packed values). and btw, it also works on register to register moves :PBroeker
@Necrolis: Yes this it is true for loads and stores of multiple packed values. But not for single 16 byte read or 16 byte write.Newman
P
2

It looks like AMD will also specify in the next revision of their manual that aligned 16b loads and stores are atomic on their x86 processors which supports AVX. (Source)

Apologies for late response!

We would update the AMD APM manuals in the next revision.

For all AMD architectures,

Processors that support AVX extend the atomicity for cacheable, naturally-aligned single loads or stores from a quadword to a double quadword.

which means all 128b instructions, even the *MOVDQU instructions, are atomic if they end up being naturally aligned.

Can we extend this patch to AMD processors as well. If not, I will plan to submit the patch for stage-1!

With this, the patch making libatomic use vmovdqa in their implementation of __atomic_load_16 and __atomic_store_16 not only on Intel processors with AVX but also on AMD processors with AVX has landed on the master branch.

Pincus answered 28/11, 2022 at 13:11 Comment(0)
N
-1

EDIT: In the last two days I have made several tests on my three PCs and I didn't reproduce any memory error, so I can't say anything more precisely. Maybe is this memory error also dependent from OS.

EDIT: I'm programing in Delphi and not in C but I should understand C. So I have translated the code, here are you have the threads procedures where the main part is made in assembler:

procedure TThread1.Execute;
var
  n             :cardinal;
const
  ConstAll0     :array[0..3] of integer =(0,0,0,0);
begin
  for n := 0 to 100000000 do
    asm
      movdqa    xmm0, dqword [x]
      movmskps  eax, xmm0
      inc       dword ptr[n1 + eax *4]
      movdqu    xmm0, dqword [ConstAll0]
      movdqa    dqword [x], xmm0
    end;
end;

{ TThread2 }

procedure TThread2.Execute;
var
  n             :cardinal;
const
  ConstAll1     :array[0..3] of integer =(-1,-1,-1,-1);
begin
  for n := 0 to 100000000 do
    asm
      movdqa    xmm0, dqword [x]
      movmskps  eax, xmm0
      inc       dword ptr[n2 + eax *4]
      movdqu    xmm0, dqword [ConstAll1]
      movdqa    dqword [x], xmm0
    end;
end;

Result: no mistake on my quad core PC and no mistake on my dual core PC as expected!

  1. PC with Intel Pentium4 CPU
  2. PC with Intel Core2 Quad CPU Q6600
  3. PC with Intel Core2 Duo CPU P8400

Can you show how debuger see your thread procedure code? Please...

Newman answered 7/10, 2011 at 1:52 Comment(7)
FWIW, when I ran the tests for my answer, I checked that GCC 4.1 and 4.4 (x86 and x86-64) use movdqa when storing to x.Rapp
@janneb: ok I have translated main thread part of code to memonic and test it on two PCs. Result: no mistake!Newman
@GJ: Thanks. Could you please update your answer to include the kind(s) of CPU(s) you tested - just like "janneb" did.Deering
@GJ: About the Pentium4 CPU you tested: Does it have multiple cores and/or hyper-threading?Deering
@Atom: First one has hyper-threading. Other two are woking under 4 and 2 cores. OS first two win XP and last one win Vista.Newman
@GJ: I marked janneb's answer as the correct one. The bounty goes to you.Deering
rigtorp.se/isatomic has test results for more modern CPUs.Nonparous
R
-1

Lot of answers have been posted so far and hence lot of information is already available (as a side effect lot of confusion too). I would like to site facts from Intel manual regarding hardware guaranteed atomic operations ...

In Intel's latest processors of nehalem and sandy bridge family, reading or writing to a quadword aligned to 64 bit boundary is guaranteed.

Even unaligned 2, 4 or 8 byte reads or writes are guaranteed to be atomic provided they are cached memory and fit in a cache line.

Having said that the test posted in this question passes on sandy bridge based intel i5 processor.

Rajput answered 10/10, 2011 at 10:46 Comment(1)
This question is specifically about 16 byte reads/writes, quadword is 8 bytes. In any case, thanks for running the test program.Deering

© 2022 - 2024 — McMap. All rights reserved.