Atomic 16 byte read on x64 CPUs
Asked Answered
A

2

11

I need to read/write 16 bytes atomically. I do the writing only using cmpxchg16, which is available on all x64 processors except I think for one obscure AMD one.

Now the question is for aligned 16 byte values, only ever modified using cmpxchg16 (which acts like a full memory barrier) is it ever possible to read a 16 byte location that's half old data and half new data?

As long as I read with a SSE instruction (so the thread cannot be interrupted in the middle of the read) I think that it's impossible (even in multiprocessor numa systems) for the read to see inconsistent data. I think it must be atomic.

I am making the assumption that when cmpxchg16 is executed, it modifies the 16 bytes atomically, not by writing two 8 byte blocks with the potential for other threads to do a read in between (honestly I don't see how it could work if it wasn't atomic.)

Am I right? If I'm wrong, is there a way to do an atomic 16 byte read without resorting to locking?

Note: There are a couple similar questions here but they don't deal with the case where the writes are done only with cmpxchg16, so I feel this is a seperate, unanswered question.

Edit: Actually I think my reasoning was faulty. An SSE load instruction may be executed as two 64bit reads, and it may be possible for the cmpxchg16 to be executed in between the two reads by another processor.

Alisonalissa answered 15/3, 2012 at 19:15 Comment(3)
It was already answered in the linked question that 16-byte SSE reads can be implemented with multiple memory accesses, i.e. they are not atomic. It doesn't make a difference that your writes are done atomically with CMPXCHG16B. Reads also have to be atomic or you may see inconsistent data. AFAIK your only choice is to read with CMPXCHG16B.Yamen
Yeh, I made the mistake of thinking I only have to stop the thread from being interrupted between the reads, but the actual bus operations themselves could still be interleaved.Alisonalissa
Using cmpxchg16b on the reads would slow them down unacceptably. But by using 25% more memory I can do a seqlock style approach like Dmitry Vyukov's hashmap: 1024cores.net/home/downloadsAlisonalissa
B
9
typedef struct
{
  unsigned __int128 value;
} __attribute__ ((aligned (16))) atomic_uint128;

unsigned __int128
atomic_read_uint128 (atomic_uint128 *src)
{
  unsigned __int128 result;
  asm volatile ("xor %%rax, %%rax;"
                "xor %%rbx, %%rbx;"
                "xor %%rcx, %%rcx;"
                "xor %%rdx, %%rdx;"
                "lock cmpxchg16b %1" : "=A"(result) : "m"(*src) : "rbx", "rcx");
  return result;
}

That should do the trick. The typedef ensures correct alignment. The cmpxchg16b needs the data to be aligned on a 16 byte boundary.

The cmpxchg16b will test if *src contains a zero and write a zero if so (nop). In either case the correct value will stand in RAX:RDX afterwards.

The code above evaluates to something as simple as

push   %rbx
xor    %rax,%rax
xor    %rbx,%rbx
xor    %rcx,%rcx
xor    %rdx,%rdx
lock cmpxchg16b (%rdi)
pop    %rbx
retq
Brandonbrandt answered 15/3, 2012 at 19:32 Comment(1)
Yes, I think this must be the way to do it. It occurs to me now that a simple SSE load can be split into two 64bit reads and that the cmpxchg16 could potentially occur between the reads.Alisonalissa
U
1

According to references http://siyobik.info/main/reference/instruction/CMPXCHG8B%2FCMPXCHG16B the CMPXCHG16 is not by default atomic but can be made atomic by using LOCK http://siyobik.info/main/reference/instruction/LOCK

That means that by default, data can be changed within the read and write phases. Locking makes both the read and write atomic.

Underpinning answered 15/3, 2012 at 19:23 Comment(2)
"Note that CMPXCHG16B requires that the destination (memory) operand be 16-byte aligned."Brandonbrandt
Sorry, yes I meant cmpxchg16 with the lock prefix. But lock cannot be used with SSE instructions.Alisonalissa

© 2022 - 2024 — McMap. All rights reserved.