C11 Atomic Acquire/Release and x86_64 lack of load/store coherence?
Asked Answered
A

1

10

I am struggling with Section 5.1.2.4 of the C11 Standard, in particular the semantics of Release/Acquire. I note that https://preshing.com/20120913/acquire-and-release-semantics/ (amongst others) states that:

... Release semantics prevent memory reordering of the write-release with any read or write operation that precedes it in program order.

So, for the following:

typedef struct test_struct
{
  _Atomic(bool) ready ;
  int  v1 ;
  int  v2 ;
} test_struct_t ;

extern void
test_init(test_struct_t* ts, int v1, int v2)
{
  ts->v1 = v1 ;
  ts->v2 = v2 ;
  atomic_store_explicit(&ts->ready, false, memory_order_release) ;
}

extern int
test_thread_1(test_struct_t* ts, int v2)
{
  int v1 ;
  while (atomic_load_explicit(&ts->ready, memory_order_acquire)) ;
  ts->v2 = v2 ;       // expect read to happen before store/release 
  v1     = ts->v1 ;   // expect write to happen before store/release 
  atomic_store_explicit(&ts->ready, true, memory_order_release) ;
  return v1 ;
}

extern int
test_thread_2(test_struct_t* ts, int v1)
{
  int v2 ;
  while (!atomic_load_explicit(&ts->ready, memory_order_acquire)) ;
  ts->v1 = v1 ;
  v2     = ts->v2 ;   // expect write to happen after store/release in thread "1"
  atomic_store_explicit(&ts->ready, false, memory_order_release) ;
  return v2 ;
}

where those are executed:

>   in the "main" thread:  test_struct_t ts ;
>                          test_init(&ts, 1, 2) ;
>                          start thread "2" which does: r2 = test_thread_2(&ts, 3) ;
>                          start thread "1" which does: r1 = test_thread_1(&ts, 4) ;

I would, therefore, expect thread "1" to have r1 == 1 and thread "2" to have r2 = 4.

I would expect that because (following paras 16 and 18 of sect 5.1.2.4):

  • all the (not atomic) reads and writes are "sequenced before" and hence "happen before" the atomic write/release in thread "1",
  • which "inter-thread-happens-before" the atomic read/acquire in thread "2" (when it reads 'true'),
  • which in turn is "sequenced before" and hence "happens before" the (not atomic) reads and writes (in thread "2").

However, it is entirely possible that I have failed to understand the standard.

I observe that the code generated for x86_64 includes:

test_thread_1:
  movzbl (%rdi),%eax      -- atomic_load_explicit(&ts->ready, memory_order_acquire)
  test   $0x1,%al
  jne    <test_thread_1>  -- while is true
  mov    %esi,0x8(%rdi)   -- (W1) ts->v2 = v2
  mov    0x4(%rdi),%eax   -- (R1) v1     = ts->v1
  movb   $0x1,(%rdi)      -- (X1) atomic_store_explicit(&ts->ready, true, memory_order_release)
  retq   

test_thread_2:
  movzbl (%rdi),%eax      -- atomic_load_explicit(&ts->ready, memory_order_acquire)
  test   $0x1,%al
  je     <test_thread_2>  -- while is false
  mov    %esi,0x4(%rdi)   -- (W2) ts->v1 = v1
  mov    0x8(%rdi),%eax   -- (R2) v2     = ts->v2   
  movb   $0x0,(%rdi)      -- (X2) atomic_store_explicit(&ts->ready, false, memory_order_release)
  retq   

And provided that R1 and X1 happen in that order, this gives the result I expect.

But my understanding of x86_64 is that reads happen in order with other reads and writes happen in order with other writes, but reads and writes may not happen in order with each other. Which implies it is possible for X1 to happen before R1, and even for X1, X2, W2, R1 to happen in that order -- I believe. [This seems desperately unlikely, but if R1 were held up by some cache issues ?]

Please: what am I not understanding?

I note that if I change the loads/stores of ts->ready to memory_order_seq_cst, the code generated for the stores is:

  xchg   %cl,(%rdi)

which is consistent with my understanding of x86_64 and will give the result I expect.

Astrodynamics answered 9/2, 2020 at 16:0 Comment(2)
On x86, all ordinary (not non-temporal) stores have release semantics. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide, 8.2.3.3 Stores Are Not Reordered With Earlier Loads. So your compiler is correctly translating your code (how surprising), such that your code is effectively completely sequential and nothing interesting happens concurrently.Agriculturist
Thank you ! (I was going quietly bonkers.) FWIW I recommend link -- particularly section 3, the "Programmer's Model". But to avoid the mistake I fell into, note that in "3.1 The Abstract Machine" there are "hardware threads" each of which is "a single in-order stream of instruction execution" (my emphasis added). I can now return to trying to understand the C11 Standard... with less cognitive dissonance :-)Astrodynamics
C
1

x86's memory model is basically sequential-consistency plus a store buffer (with store forwarding). So every store is a release-store1. This is why only seq-cst stores need any special instructions. (C/C++11 atomics mappings to asm). Also, https://stackoverflow.com/tags/x86/info has some links to x86 docs, including a formal description of the x86-TSO memory model (basically unreadable for most humans; requires wading through a lot of definitions).

Since you're already reading Jeff Preshing's excellent series of articles, I'll point you at another one that goes into more detail: https://preshing.com/20120930/weak-vs-strong-memory-models/

The only reordering that's allowed on x86 is StoreLoad, not LoadStore, if we're talking in those terms. (Store forwarding can do extra fun stuff if a load only partially overlaps a store; Globally Invisible load instructions, although you'll never get that in compiler-generated code for stdatomic.)

@EOF commented with the right quote from Intel's manual:

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide, 8.2.3.3 Stores Are Not Reordered With Earlier Loads.


Footnote 1: ignoring weakly-ordered NT stores; this is why you normally sfence after doing NT stores. C11 / C++11 implementations assume you aren't using NT stores. If you are, use _mm_sfence before a release operation to make sure it respects your NT stores. (In general don't use _mm_mfence / _mm_sfence in other cases; usually you only need to block compile-time reordering. Or of course just use stdatomic.)

Chaffer answered 11/2, 2020 at 11:33 Comment(3)
I find the x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors more readable than the (related) Formal Description you referenced. But my real ambition is to fully understand sections 5.1.2.4 and 7.17.3 of the C11/C18 Standard. In particular, I think I get Release/Acquire/Acquire+Release, but memory_order_seq_cst is defined separately and I am struggling to see how they all fit together :-(Astrodynamics
@ChrisHall: I found it helped to realize just exactly how weak acq/rel can be, and for that you need to look at machines like POWER that can do IRIW reordering. (which seq-cst forbids but acq/rel doesn't). Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. Also How to achieve a StoreLoad barrier in C++11? has some discussion about just how little the standard formally guarantees about ordering outside of sychronizes-with or everything-seq-cst cases.Chaffer
@ChrisHall: The main thing seq-cst does is block StoreLoad reordering. (On x86 that's the only thing it does beyond acq/rel). preshing.com/20120515/memory-reordering-caught-in-the-act uses asm, but it's equivalent to seq-cst vs. acq/relChaffer

© 2022 - 2024 — McMap. All rights reserved.