C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?

I am struggling with Section 5.1.2.4 of the C11 Standard, in particular the semantics of Release/Acquire. I note that https://preshing.com/20120913/acquire-and-release-semantics/ (amongst others) states that:

... Release semantics prevent memory reordering of the write-release with any read or write operation that precedes it in program order.

So, for the following:

typedef struct test_struct
{
  _Atomic(bool) ready ;
  int  v1 ;
  int  v2 ;
} test_struct_t ;

extern void
test_init(test_struct_t* ts, int v1, int v2)
{
  ts->v1 = v1 ;
  ts->v2 = v2 ;
  atomic_store_explicit(&ts->ready, false, memory_order_release) ;
}

extern int
test_thread_1(test_struct_t* ts, int v2)
{
  int v1 ;
  while (atomic_load_explicit(&ts->ready, memory_order_acquire)) ;
  ts->v2 = v2 ;       // expect read to happen before store/release 
  v1     = ts->v1 ;   // expect write to happen before store/release 
  atomic_store_explicit(&ts->ready, true, memory_order_release) ;
  return v1 ;
}

extern int
test_thread_2(test_struct_t* ts, int v1)
{
  int v2 ;
  while (!atomic_load_explicit(&ts->ready, memory_order_acquire)) ;
  ts->v1 = v1 ;
  v2     = ts->v2 ;   // expect write to happen after store/release in thread "1"
  atomic_store_explicit(&ts->ready, false, memory_order_release) ;
  return v2 ;
}

where those are executed:

>   in the "main" thread:  test_struct_t ts ;
>                          test_init(&ts, 1, 2) ;
>                          start thread "2" which does: r2 = test_thread_2(&ts, 3) ;
>                          start thread "1" which does: r1 = test_thread_1(&ts, 4) ;

I would, therefore, expect thread "1" to have r1 == 1 and thread "2" to have r2 = 4.

I would expect that because (following paras 16 and 18 of sect 5.1.2.4):

all the (not atomic) reads and writes are "sequenced before" and hence "happen before" the atomic write/release in thread "1",
which "inter-thread-happens-before" the atomic read/acquire in thread "2" (when it reads 'true'),
which in turn is "sequenced before" and hence "happens before" the (not atomic) reads and writes (in thread "2").

However, it is entirely possible that I have failed to understand the standard.

I observe that the code generated for x86_64 includes:

test_thread_1:
  movzbl (%rdi),%eax      -- atomic_load_explicit(&ts->ready, memory_order_acquire)
  test   $0x1,%al
  jne    <test_thread_1>  -- while is true
  mov    %esi,0x8(%rdi)   -- (W1) ts->v2 = v2
  mov    0x4(%rdi),%eax   -- (R1) v1     = ts->v1
  movb   $0x1,(%rdi)      -- (X1) atomic_store_explicit(&ts->ready, true, memory_order_release)
  retq   

test_thread_2:
  movzbl (%rdi),%eax      -- atomic_load_explicit(&ts->ready, memory_order_acquire)
  test   $0x1,%al
  je     <test_thread_2>  -- while is false
  mov    %esi,0x4(%rdi)   -- (W2) ts->v1 = v1
  mov    0x8(%rdi),%eax   -- (R2) v2     = ts->v2   
  movb   $0x0,(%rdi)      -- (X2) atomic_store_explicit(&ts->ready, false, memory_order_release)
  retq

And provided that R1 and X1 happen in that order, this gives the result I expect.

But my understanding of x86_64 is that reads happen in order with other reads and writes happen in order with other writes, but reads and writes may not happen in order with each other. Which implies it is possible for X1 to happen before R1, and even for X1, X2, W2, R1 to happen in that order -- I believe. [This seems desperately unlikely, but if R1 were held up by some cache issues ?]

Please: what am I not understanding?

I note that if I change the loads/stores of ts->ready to memory_order_seq_cst, the code generated for the stores is:

  xchg   %cl,(%rdi)

which is consistent with my understanding of x86_64 and will give the result I expect.

x86's memory model is basically sequential-consistency plus a store buffer (with store forwarding). So every store is a release-store¹. This is why only seq-cst stores need any special instructions. (C/C++11 atomics mappings to asm). Also, https://stackoverflow.com/tags/x86/info has some links to x86 docs, including a formal description of the x86-TSO memory model (basically unreadable for most humans; requires wading through a lot of definitions).

Since you're already reading Jeff Preshing's excellent series of articles, I'll point you at another one that goes into more detail: https://preshing.com/20120930/weak-vs-strong-memory-models/

The only reordering that's allowed on x86 is StoreLoad, not LoadStore, if we're talking in those terms. (Store forwarding can do extra fun stuff if a load only partially overlaps a store; Globally Invisible load instructions, although you'll never get that in compiler-generated code for stdatomic.)

@EOF commented with the right quote from Intel's manual:

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide, 8.2.3.3 Stores Are Not Reordered With Earlier Loads.

Footnote 1: ignoring weakly-ordered NT stores; this is why you normally sfence after doing NT stores. C11 / C++11 implementations assume you aren't using NT stores. If you are, use _mm_sfence before a release operation to make sure it respects your NT stores. (In general don't use _mm_mfence / _mm_sfence in other cases; usually you only need to block compile-time reordering. Or of course just use stdatomic.)

Recommended topics

Hot tags