Memory ordering behavior of std::atomic::load

Asked 28/2, 2015 at 15:36 Answered 4/3, 2015 at 2:39

Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?

To illustrate:

volatile bool arm1 = false;
std::atomic_bool arm2 = false;
bool triggered = false;

Thread1:

arm1 = true;
//std::std::atomic_thread_fence(std::memory_order_seq_cst); // this would do the trick 
if (arm2.load())
    triggered = true;

Thread2:

arm2.store(true);
if (arm1)
    triggered = true;

I expected that after executing both 'triggered' would be true. Please don't suggest to make arm1 atomic, the point is to explore the behavior of atomic::load.

While I have to admit I don't fully understand the formal definitions of the different relaxed semantics of memory order I thought that the sequentially consistent ordering was pretty straightforward in that it guarantees that "a single total order exists in which all threads observe all modifications in the same order." To me this implies that the std::atomic::load with the default memory order of std::memory_order_seq_cst will also act as a memory fence. This is further corroborated by the following statement under "Sequentially-consistent ordering":

Total sequential ordering requires a full memory fence CPU instruction on all multi-core systems.

Yet, my simple example below demonstrates this is not the case with MSVC 2013, gcc 4.9 (x86) and clang 3.5.1 (x86), where the atomic load simply translates to a load instruction.

#include <atomic>

std::atomic_long al;

#ifdef _WIN32
__declspec(noinline)
#else
__attribute__((noinline))
#endif
long load() {
    return al.load(std::memory_order_seq_cst);
}

int main(int argc, char* argv[]) {
    long r = load();
}

With gcc this looks like:

load():
   mov  rax, QWORD PTR al[rip]   ; <--- plain load here, no fence or xchg
   ret
main:
   call load()
   xor  eax, eax
   ret

I'll omit the msvc and clang which are essentially identical. Now on gcc for ARM we get what I expected:

load():
     dmb    sy                         ; <---- data memory barrier here
     movw   r3, #:lower16:.LANCHOR0
     movt   r3, #:upper16:.LANCHOR0
     ldr    r0, [r3]                   
     dmb    sy                         ; <----- and here
     bx lr
main:
    push    {r3, lr}
    bl  load()
    movs    r0, #0
    pop {r3, pc}

This is not an academic question, it results in a subtle race condition in our code which called into question my understanding of the behavior of std::atomic.

Mantellone answered 28/2, 2015 at 15:36 Comment(6)

(The store would require a fence.) – Heidi 28/2, 2015 at 19:35

@tc it is wrong to say that loads have seq_cst semantics on x86. But you are right that they are strong enough.. to have aquire semantics – Easton 28/2, 2015 at 20:26

The issue is not with the various processors memory models but rather with the C++ standard guarantees for atomic::load. Please take a look at the edit which now contains an example of my (possibly incorrect) expectations. – Mantellone 1/3, 2015 at 9:43

@Heidi This is starting to make sense but I find it confusing that the behavior of atomic_thread_fence(std::memory_order_seq_cst) is completely different from atomic::load(std::memory_order_seq_cst). While the atomic::load provides guarantees with respect to the object only the fence seems to have a global effect. – Mantellone 1/3, 2015 at 12:50

There's a data race on arm1, which is accessed by both threads unconditionally and without any chance of synchronization. Note that the fact that it's volatile doesn't help; volatile has no effect on data races and so it might as well be dropped. – Eucken 17/8, 2022 at 6:58

seq_cst guarantees that a single total order exists in which all threads doing seq_cst loads observe all seq_cst modifications in the same order. Operations with weaker ordering do not participate in that total order and retain the right to be screwy. – Eucken 17/8, 2022 at 6:59

Sigh, this was too long for a comment:

Isn't the meaning of atomic "to appear to occur instantaneously to the rest of the system"?

I'd say yes and no to that one, depending on how you think of it. For writes with SEQ_CST, yes. But as far as how atomic loads are handled, check out 29.3 of the C++11 standard. Specifically, 29.3.3 is really good reading, and 29.3.4 might be specifically what you're looking for:

For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_- cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.

Basically, SEQ_CST forces a global order just like the standard says, but reads can return and old value without violating the 'atomic' constraint.

To accomplish 'getting the absolute latest value' you'll need to perform an operation that forces the hardware coherency protocol to lock(the lock instruction on x86_64). This is what the atomic compare-and-exchange operations do, if you look at the assembly output.

Farsighted answered 4/3, 2015 at 2:39 Comment(3)

This was a really helpful clarification! So basically my example is wrong on 2 counts: a) there is no guarantee that the arm2.load will read the most up-to-date value without a preceding atomic_thread_fence(seq_cst); and b) the arm2.load(seq_cst) has no effect whatsoever on the preceding arm1 store. If I understood you correctly to fix the hypothetical example I would need to not only make arm1 also atomic but also introduce memory fences in both threads between the store and the load. – Mantellone 4/3, 2015 at 13:36

Well... a) the call arm2.load() actually DOES have a SEQ_CST barrier if you don't specify any arguments, that's the memory order it defaults to. Also, no type of memory fence will allow it to capture the 'guaranteed most recent value', because that's not really what memory fences are for. I prefer to think of it as reads are always relativistic. You get a delayed picture of what is happening in memory. This is due to the speed of light (signal propagation speed limit). b) Correct. You only need the fence with the store. And to fix, you'll only need to make arm2 atomic. – Farsighted 5/3, 2015 at 1:59

To clarify, you also only need a store-store fence there, which is the same as release semantics. The SEQ_CST is unnecessary. – Farsighted 5/3, 2015 at 2:2

Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?

Yes. atomic::load(SEQ_CST) just enforces that the read cannot load an 'invalid' value, and neither writes nor loads may be reordered by the compiler or the cpu around that statement. It does not mean you'll always get the most up to date value.

I would expect your code to have a data race because again, barriers do not ensure the most up to date value is seen at a given time, they just prevent reordering.

Its perfectly valid for Thread1 to not see the write by Thread2 and therefore not set triggered, and for Thread2 to not see the write by Thread1 (again, not setting triggered), because you only write 'atomically' from one thread.

With two threads writing and reading shared values, you'll need a barrier in each thread to maintain consistency. It looks like you knew this already based in your code comments, so I'll just leave it at "the C++ standard is somewhat misleading when it comes to accurately describing meaning of atomic / multithreaded operations".

Even though you're writing C++, its still best, in my opinion, to think about what you're doing on the underlying architecture.

Not sure I explained that well, but I'd be happy to go into more detail if you'd like.

Farsighted answered 1/3, 2015 at 17:0 Comment(2)

Isn't the meaning of atomic "to appear to occur instantaneously to the rest of the system"? en.wikipedia.org/wiki/Atomicity_(programming). To me it means that the atomic::load is indeed guaranteed to get the most up-to-date value written by the last atomic::store. I know this to be true at least for the x86 architectures. – Mantellone 3/3, 2015 at 8:45

The problem is, which of the stores in the program was "the last"? You don't know which one it will be until after you see the value that was loaded. Catch-22. – Eucken 17/8, 2022 at 7:3

Recommended topics

Hot tags