Loads and stores reordering on ARM

Asked 28/11, 2019 at 12:36 Answered 28/11, 2019 at 13:15

c++arm memory-barriers memory-model stdatomic

I'm not an ARM expert but won't those stores and loads be subjected to reordering at least on some ARM architectures?

  atomic<int> atomic_var; 
  int nonAtomic_var;
  int nonAtomic_var2;

  void foo()
  {       
          atomic_var.store(111, memory_order_relaxed);
          atomic_var.store(222, memory_order_relaxed);
  }

  void bar()
  {       
          nonAtomic_var = atomic_var.load(memory_order_relaxed);
          nonAtomic_var2 = atomic_var.load(memory_order_relaxed);
  }

I've had no success in making the compiler put memory barriers between them.

I've tried something like below (on x64):

$ arm-linux-gnueabi-g++ -mcpu=cortex-a9 -std=c++11 -S -O1 test.cpp

And I've got:

_Z3foov:
          .fnstart
  .LFB331:
          @ args = 0, pretend = 0, frame = 0
          @ frame_needed = 0, uses_anonymous_args = 0
          @ link register save eliminated.
          movw    r3, #:lower16:.LANCHOR0
          movt    r3, #:upper16:.LANCHOR0
          mov     r2, #111
          str     r2, [r3]
          mov     r2, #222
          str     r2, [r3]
          bx      lr
          ;...
  _Z3barv:
          .fnstart
  .LFB332:
          @ args = 0, pretend = 0, frame = 0
          @ frame_needed = 0, uses_anonymous_args = 0
          @ link register save eliminated.
          movw    r3, #:lower16:.LANCHOR0
          movt    r3, #:upper16:.LANCHOR0
          ldr     r2, [r3]
          str     r2, [r3, #4]
          ldr     r2, [r3]
          str     r2, [r3, #8]
          bx      lr

Are loads and stores to the same location never reordered on ARM? I couldn't find such restriction in the ARM docs.

I'm asking in regard to the c++11 standard which states that:

All modifications to any particular atomic variable occur in a total order that is specific to this one atomic variable.

Progressive answered 28/11, 2019 at 12:36 Comment(0)

The total order for a single variable exists because of cache coherency (MESI): a store can't commit from the store buffer into L1d cache and become globally visible to other threads unless the core owns exclusive access to that cache line. (MESI Exclusive or Modified state.)

That C++ guarantee doesn't require any barriers to implement on any normal CPU architecture because all normal ISAs have coherent caches, normally using a variant of MESI. This is why volatile happens to work as a legacy / UB version of mo_relaxed atomic on mainstream C++ implementations (but generally don't do it). See also When to use volatile with multi threading? for more details.

(Some systems exist with two different kinds of CPU that share memory, e.g. microcontroller + DSP, but C++ std::thread won't start threads across cores that don't share a coherent view of that memory. So compilers only have to do code-gen for ARM cores in the same inner-shared coherency domain.)

For any given atomic object, a total order of modification by all threads will always exist (as guaranteed by the ISO C++ standard you quoted), but you don't know ahead of time what it's going to be unless you establish synchronization between threads.

e.g. different runs of this program could have both loads go first, or one load then both stores then the other load.

This total order (for a single variable) will be compatible with program order for each thread, but is an arbitrary interleaving of program orders.

memory_order_relaxed only atomic operation on that variable, not ordering wrt. anything else. The only ordering that's fixed at compile time is wrt. other accesses to the same atomic variable by this thread.

Different threads will agree on the modification order for this variable, but might disagree on the global modification order for all objects. (ARMv8 made the ARM memory model multi-copy-atomic so this is impossible (and probably no real earlier ARM violated that), but POWER does in real life allow two independent reader threads to disagree on the order of stores by 2 other independent writer threads. This is called IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

The fact that IRIW reordering is a possibility when multiple variables are involved is (among other things) why it even needs to be said that a total modification order does always exist for each individual variable separately.

For an all-thread total order to exist, you need all your atomic accesses to use seq_cst, which would involve barriers. But that still wouldn't of course fully determine at compile time what that order will be; different timings on different runs will lead to acquire loads seeing a certain store or not.

Are loads and stores to the same location never reordered on ARM?

From within a single thread no. If you do multiple stores to a memory location, the last one in program order will always appear as the last to other threads. i.e. once the dust settles, the memory location will have the value stored by the last store. Anything else would break the illusion of program order for threads reloading their own stores.

Some of the ordering guarantees in the C++ standard are even called "write-write coherency" and other kinds of coherency. ISO C++ doesn't explicitly require coherent caches (an implementation on an ISA that needs explicit flushing is possible), but would not be efficient.

http://eel.is/c++draft/intro.races#19

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note ]

Most of the above is about modification order, not LoadLoad reordering.

That is a separate thing. C++ guarantees read-read coherence, i.e. that 2 reads of the same atomic object by the same thread happen in program order relative to each other.

http://eel.is/c++draft/intro.races#16

If a value computation A of an atomic object M happens before a value computation B of M, and A takes its value from a side effect X on M, then the value computed by B shall either be the value stored by X or the value stored by a side effect Y on M, where Y follows X in the modification order of M. [ Note: This requirement is known as read-read coherence. — end note ]

A "value computation" is a read aka load of a variable. The highlighted phrase is the part that guarantees that later reads in the same thread can't observe earlier writes from other threads (earlier than a write they already saw).

That's one of the 4 conditions that the previous quote I linked was talking about.

The fact that compilers compile it to two plain ARM loads is proof enough that the ARM ISA also guarantees this. (Because we know for sure that ISO C++ requires it.)

I'm not familiar with ARM manuals but presumably it's in there somewhere.

See also A Tutorial Introduction to the ARM and POWER Relaxed Memory Models - a paper that goes into significant detail about what reorderings are/aren't allowed for various test cases.

Mondrian answered 28/11, 2019 at 13:15 Comment(15)

"memory_order_relaxed only assures atomic execution of the atomic variable itself. It's ordered only wrt. other accesses to the same atomic variable by this thread." - To be sure are you saying that two different threads can see different modification order? If not then even without cache at all all is needed for two different threads to see different modification order is that a processor on one of them reordered loads. Am I missing something? – Progressive 28/11, 2019 at 13:52

@listerreg: No, that would violate the standard. The all-threads order is compatible with program order for each thread, but can be an arbitrary interleaving. (Again, this is the order for one specific variable.) Rephrased that part of my answer. – Mondrian 28/11, 2019 at 13:56

I know this is not a chat but I’m struggling with this for a month already. Is the following possible: core1 setting atomic_var to 111, core2 reading this value and assigning it to nonAtomic_var, core3 (reordering loads) reading this value and assigning it to nonAtomic_var2. So at the end the core2 has nonAtomic_var set to 111 and nonAtomic_var2 set to 222 and the core3 the other way around? – Progressive 28/11, 2019 at 14:20

@listerreg: So core 1 runs foo (the writer), cores 2 and 3 run bar() the reader? Except with the nonAtomic_vars being thread-local or function locals, so two readers don't conflict with each other. No, that's not possible for core3. Two reads of the same variable by the same thread can't reorder with each other at compile time or run-time. That's read-read coherence; see the link to the ISO standard. You could have both loads in core3 see 0, 111, or 222, but not 222 then 111; that would imply a modification order incompatible with the writer's program order. – Mondrian 28/11, 2019 at 14:40

Yes, thank you! So the c++ standard forbids it. How can I convince myself that the ARM processor will also forbid it? I cannot see how cache has anything to do with it. As I’ve written above there could be no cache at all only instruction reordering (reordered loads from the same location) to fail this c++ standard requirement. Is it this non-IRIW reordering model on ARM that guarantees this? Should I look deeper into the multi-copy-atomic idiom to get this right? – Progressive 28/11, 2019 at 14:59

@listerreg: If there's no cache, then only one core can access memory at once. Whatever order the memory controller happens to allow operations to go in, that's the total order. If there is cache, it's HW MESI coherency protocol messages that establish an order. It's not IRIW; the reads are not independent they're of the same variable. I haven't looked at ARM manuals but on x86 the manuals do say memory operations on the same location aren't reordered with each other. This is necessary for correct execution of a single thread, as I said in my answer. – Mondrian 28/11, 2019 at 15:4

But it’s not the writes that border me (maybe I shouldn’t include the foo function in my question after all). You can have different observed order among threads (cores) just with reordered reads from the same location (I don’t know, maybe later read carries address dependency into an instruction that is earlier in the code and processor decides to execute it first - I know very little about processors, is it possible?). I know the x86 doesn’t allow this but I haven’t found similar info in the ARM docs. – Progressive 28/11, 2019 at 15:29

@listerreg: Ok, yes, most of what I was saying was about modification order, not LoadLoad reordering. That is a separate thing. C++ guarantees read-read coherence, i.e. that 2 reads of the same atomic object by the same thread happen in program order relative to each other. eel.is/c++draft/intro.races#16 (One of the 4 conditions that the quote I linked was talking about). The fact that compilers compile it to two plain ARM loads is proof enough that the ARM ISA also guarantees this. – Mondrian 28/11, 2019 at 15:34

@listerreg: updated my answer with those details and a link to a memory model document. – Mondrian 28/11, 2019 at 15:49

I believe that you cannot distinguish the total modification order from the loads reordering. The loads reordering is a part of the observation process which constitutes this total order. It’s not a strict technical concept where a core has the “ability” to observe modifications in particular order but it’s the way in which a program actually did. And the “observation” here is understood as visible side effects of a program not as what the core could technically “see”. So if two variables in one thread are assigned a different value from the same memory location and we can somehow examine... – Progressive 28/11, 2019 at 17:37

...those values then we say that this location was modified in accordance with the order of those assignments in the code. What is important for me is that I don’t know if compilers do compile code as above. I could use invalid flags or not use the right flags (resulting in code for some simplistic ARM implementation). I could use some peculiar compiler. I don’t know that that’s why I asked this question. – Progressive 28/11, 2019 at 17:38

@listerreg: I'm pretty certain that all CPUs across all sane ISAs (including ARM, POWER, MIPS, etc.) provide this same-address load-ordering guarantee for free with no barriers, and that's what the note in the C++ standard was talking about exposing something that typical hardware provides. You can use clang or gcc with -O3 -mcpu=cortex-a57 (godbolt.org/z/CwQU7C) if you want to compile for a specific model of ARM, but I don't think GCC takes advantage of a target microarch being simple and omitting barriers with e.g. -mcpu=cortex-m0 even though it's a single-core CPU. – Mondrian 28/11, 2019 at 22:32

@listerreg: If you don't use -mcpu=anything, GCC makes code that works on all ARM CPUs. – Mondrian 28/11, 2019 at 22:33

@PeterCordes For an all-thread total order to exist, you need all your atomic accesses to use seq_cst - I guess that’s needed to prevent a core from seeing its own stores before they’re visible to other cores (via store to load forwarding). Otherwise a store-release/read-acquire is sufficient? – Slifka 11/3, 2022 at 11:59

@DanielNitzan: Yes, that's correct, blocking StoreLoad reordering is the major effect / cost of seq_cst. The ARM ISA is multi-copy atomic (guaranteed on paper in ARMv8, and in practice in earlier silicon), so a store becomes visible to all other cores simultaneously. A thread seeing its own stores early is the only difference between acq_rel and seq_cst, I think. In the general case, there can be more weird stuff, like IRIW reordering - e.g. on PowerPC by store-forwarding to SMT sibling cores before global visibility to other physical cores. – Mondrian 11/3, 2022 at 12:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags