What is the difference between load/store relaxed atomic and normal variable?
Asked Answered
I

3

9

As I see from a test-case: https://godbolt.org/z/K477q1

The generated assembly load/store atomic relaxed is the same as the normal variable: ldr and str

So, is there any difference between relaxed atomic and normal variable?

Ifill answered 9/9, 2020 at 11:5 Comment(6)
So, is there any difference between relaxed atomic and normal variable? In asm, no. In C, yes; it constrains optimization. Multithreading program stuck in optimized mode but runs normally in -O0Workshop
I think you link is not fully true. He can fix the stucking by volatileIfill
@LongLT: no volatile doesn't protect from multithreading issues. Since hardware has change a lot since this keyword was introduced it become useless. Problem is cache synchronization what volatile is unable to handle. I can't find a link where this is nicely explained, I will provide it when I find it.Roughhew
@MarekR: LongLT: You can in practice roll your own memory_order_relaxed with volatile on real hardware, but don't because there's no benefit. When to use volatile with multi threading? - basically never. But it does work in practice; the Linux kernel uses volatile this way. Real hardware (that we run threads and single-system-image OSes across) has coherent caches, please don't spread misinformation about how CPUs work. (It's not guaranteed safe by ISO C of course; volatile doesn't change the fact that it's data-race UB)Workshop
Also related: Why is integer assignment on a naturally aligned variable atomic on x86?, possibly a duplicateWorkshop
Is your Q specifically about C++11?Pronty
A
10

The difference is that a normal load/store is not guaranteed to be tear-free, whereas a relaxed atomic read/write is. Also, the atomic guarantees that the compiler doesn't rearrange or optimise-out memory accesses in a similar fashion to what volatile guarantees.

(Pre-C++11, volatile was an essential part of rolling your own atomics. But now it's obsolete for that purpose. It does still work in practice but is never recommended: When to use volatile with multi threading? - essentially never.)

On most platforms it just happens that the architecture provides a tear-free load/store by default (for aligned int and long) so it works out the same in asm if loads and stores don't get optimized away. See Why is integer assignment on a naturally aligned variable atomic on x86? for example. In C++ it's up to you to express how the memory should be accessed in your source code instead of relying on architecture-specific features to make the code work as intended.

If you were hand-writing in asm, your source code would already nail down when values were kept in registers vs. loaded / stored to (shared) memory. In C++, telling the compiler when it can/can't keep values private is part of why std::atomic<T> exists.

If you read one article on this topic, take a look at the Preshing one here: https://preshing.com/20130618/atomic-vs-non-atomic-operations/

Also try this presentation from CppCon 2017: https://www.youtube.com/watch?v=ZQFzMfHIxng


Links for further reading:


Also see Peter Cordes' linked article: https://electronics.stackexchange.com/q/387181
And a related one about the Linux kernel: https://lwn.net/Articles/793253/

No tearing is only part of what you get with std::atomic<T> - you also avoid data race undefined behaviour.

Andersen answered 9/9, 2020 at 13:6 Comment(14)
Besides lack of tearing (which happens for free in asm on most platforms), the other critical part of what std::atomic<T> gives you (even with mo_relaxed) is well-defined behaviour even with unsynchronized reads and writes. Non-atomic reads can be hoisted out of loops because of the as-if rule + data-race UB. See MCU programming - C++ O2 optimization breaks while loopWorkshop
@PeterCordes what's specifically meant by "UB" and "CSE"?Andersen
Also, Who's afraid of a big bad optimizing compiler? explains some details of the many things that can go wrong if you try to use plain non-atomic shared variables. (It's written from a context of the Linux kernel, where their solution is hand-rolled atomics using volatile, rather than C11 _Atomic or C++11 std::atomic, but same difference; the assumption of the underlying asm operation being free from tearing is what makes volatile work as a legacy way to do atomics. Plus the fact hardware has coherent caches.)Workshop
UB = Undefined Behaviour. (blog.llvm.org/posts/…). CSE = Common Subexpression Elimination, e.g. hoisting a load out of a loop like in that linked question. Oh, I see my linked answer used those terms without defining them; edited to fix. Thanks for pointing that out.Workshop
(which in many cases no longer provides enough of a guarantee in cached multiprocessor systems). - Please don't spread misinformation about CPU caches. volatile works on real hardware because caches are coherent. The Linux kernel uses it successfully. There are no C++ implementations that run std::thread threads across cores with non-coherent caches.Workshop
There are some rare boards, e.g. ARM with microcontroller + DSP, that have non-coherent shared memory, but we don't run threads of the same process across them. In the ARM memory model for example, threads are assumed to run on cores in the same inner-shareable domain. I edited your answer to fix that wrong claim that it would be a problem in practice.Workshop
Thanks @PeterCordes it was my understanding that volatile is not enough of a guarantee that a read from CPU1 via its data cache will immediately see a write from CPU2 via its data cache and flush to RAM; you would need to add acquire and release memory barriers to guarantee that?.Andersen
CPUs maintain coherency with some equivalent of MESI; a store can't commit to L1d cache until this core has exclusive ownership of the line (all other copies invalidated). A store has to wait for an RFO (Read for Ownership) if the line isn't already owned. Acquire and release have nothing at all to do with coherency for a single memory location, only ordering wrt. other locations. (And besides, the OP is asking about mo_relaxed). Also, it's hard to define the term "immediately" across CPUs. See also this Q&A.Workshop
This idea of cache being temporarily out of sync is a common misconception, but the whole point of coherent caches is to make sure this never happens. (Similar effects are created by store-forwarding inside a single core, where a core can see its own stores early to maintain its own illusion of running in program order, but the store-buffer isn't globally visible.)Workshop
Great answer! I think this is what I'm looking for. Thank youIfill
Also note that data can get between cores without having to write back all the way to RAM. Shared L3 cache is a backstop for coherency traffic.Workshop
Since you are already sharing a lot of links let me throw in yet another one: Memory Models for C/C++ ProgrammersLotz
and ...."What Every Programmer Should Know About Memory" : akkadia.org/drepper/cpumemory.pdfAndersen
Plus this is a useful video describing false sharing youtube.com/watch?v=dznxqe1Uk3EAndersen
B
9

Very good question actually, and I asked the same question when I started leaning concurrency.

I'll answer as simple as possible, even though the answer is a bit more complicated.

Reading and writing to the same non atomic variable from different threads* is undefined behavior - one thread is not guaranteed to read the value that the other thread wrote.

Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.

In fact, atomics are always thread safe, regardless of the memory order! The memory order is not for the atomics -> it's for non atomic data.

Here is the thing - if you use locks, you don't have to think about those low-level things. memory orders are used in lock-free environments where we need to synchronize non atomic data.

Here is the beautiful thing about lock free algorithms, we use atomic operations that are always thread safe, but we "piggy-pack" those operations with memory orders to synchronize the non atomic data used in those algorithms.

For example, a lock-free linked list. Usually, a lock-free link list node looks something like this:

Node:
   Atomic<Node*> next_node;
   T non_atomic_data

Now, let's say I push a new node into the list. next_node is always thread safe, another thread will always see the latest atomic value. But who grantees that other threads see the correct value of non_atomic_data?

No-one.

Here is a perfect example of the usage of memory orders - we "piggyback" atomic stores and loads to next_node by also adding memory orders that synchronize the value of non_atomic_data.

So when we store a new node to the list, we use memory_order_release to "push" the non atomic data to the main memory. when we read the new node by reading next_node, we use memory_order_acquire and then we "pull" the non atomic data from the main memory. This way we assure that both next_node and non_atomic_data are always synchronized across threads.

memory_order_relaxed doesn't synchronize any non-atomic data, it synchronizes only itself - the atomic variable. When this is used, developers can assume that the atomic variable doesn't reference any non-atomic data published by the same thread that wrote the atomic variable. In other words, that atomic variable isn't, for example, an index of a non-atomic array, or a pointer to non atomic data, or an iterator to some non-thread safe collection. (It would be fine to use relaxed atomic stores and loads for an index into a constant lookup table, or one that's synchronized separately. You only need acq/rel synchronization if the pointed-to or indexed data was written by the same thread.) This is faster (at least on some architectures) than using stronger memory orders but can be used in fewer cases.

Great, but even this is not the full answer. I said memory orders are not used for atomics. I was half-lying.

With relaxed memory order, atomics are still thread safe. but they have a downside - they can be re-ordered. look at the following snippet:

a.store(1, std::memory_order_relaxed);
b.store(2, std::memory_order_relaxed);

In reality, a.store can happen after b.store. The CPU does this all the times, it's called Out of Order Execution and its one of the optimizations techniques CPUs use to speed up execution. a and b are still thread-safe, even though the thread-safe stores might happen in a reverse order.

Now, what happens if there is a meaning for the order? Many lock-free algorithms depend on the order of atomic operations for their correctness.

Memory orders are also used to prevent reordering. This is why memory orders are so complicated, because they do 2 things at the same time.

memory_order_acquire tells the compiler and CPU not to execute operations that happen after it code-wise, before it.

similarity, memory_order_release tells the compiler and CPU not to execute operations that before it code-wise, after it.

memory_order_relaxed tells the compiler/cpu that the atomic operation can be re-ordered is possible, in a similar way non atomic operations are reordered whenever possible.

Bechuana answered 9/9, 2020 at 13:39 Comment(11)
Hmm. "Thread safe" has stronger implications than simply "free from data races", which I think is all that relaxed atomics get you. Also, I think you have oversimplified memory-order semantics to the point of being misleading.Methaemoglobin
Memory ordering is not just for non-atomic data; acquire/release synchronization also guarantees visibility of mo_relaxed stores. So the key point is that ordering is wrt. other objects, whether they're atomic or not. But yes, agreed with the overall point you're making.Workshop
Guys, Books were written about those topics. one cannot cover them all. This answer is super simplified and I'm not ashamed of it. the person is a clear beginner in concurrency, there is no need to bomb him with tons of new knowledge he does not understand. there is nothing wrong with simplifying at first, having basic understanding of the subject and then fine-tune your knowledge.Bechuana
@DavidHaim, no one is asking you to go into deep, gory detail, but especially for a neophyte, wording that gives a misleading impression is worse than just omitting or glossing over details.Methaemoglobin
So basically, you want me to write a cpp-reference style answer that made the OP come here and ask this question to begin with? thanks, I'll pass.Bechuana
The CPU does this all the times, it's called Out of Order Execution - To be precise, memory reordering (from the POV of other threads) can be separate from out-of-order execution. It can happen on in-order CPUs, especially StoreLoad reordering (by having a store buffer at all), but StoreStore reordering like your example can happen on any CPU that allows out-of-order commit from the store buffer to L1d cache. (e.g. if the first store misses in cache, allowing the 2nd to commit earlier). Out-of-order instruction execution: is commit order preserved?Workshop
Thanks for the great answer. I understand the acquire and release semantic, due to they use different load and strore instruction. Sorry, but my actual question is generated ASM are the same, so where is the difference? How CPU can guarantee the atomic load is not missed with the same instruction as non-atomic?Ifill
Some CPU architectures, like x8086, provide some assembly instructions that are anyway and always atomic. in x8086, mov is anyway atomic. So compiling an atomic store on x86 is equivalent to compiling a non atomic store - they are anyway atomic on the hardware level. Other operations, like inc are not atomic. when compiling fetch_add on x86, the compiler has to insert a loc prefix to make inc atomic. On ARM, the story could be entirely different. I'm not an ARM expert but from my knowledge, no assemby instruction is "intrinsically" atomic.Bechuana
This is why we need the C++ standard: C++ basically tells the developer "Let's talk about an abstract CPU with multiple cores. The compiler will detect the differences between what we discuss in our imaginary world and the compiler will do the right thing, for example turning an atomic store to a non atomic store on x86, because it doesn't matter for that specific CPU."Bechuana
Ok thanks, now I got your point. str and ldr here are decided by compiler, so better to use atomic. Because in some situations or some architectures, he (compiler) might uses another instruction.Ifill
And your answer is very simple to understand release/acquire semantics, I think :)Ifill
W
7

atomic<T> constrains the optimizer to not assume the value is unchanged between accesses in the same thread.

atomic<T> also makes sure the object is sufficiently aligned: e.g. some C++ implementations for 32-bit ISAs have alignof(int64_t) = 4 but alignof(atomic<int64_t>) = 8 to enable lock-free 64-bit operations. (e.g. gcc for 32-bit x86 GNU/Linux). In that case, usually a special instruction is needed that the compiler might not use otherwise, e.g. ARMv8 32-bit ldp load-pair, or x86 SSE2 movq xmm before bouncing to integer regs.


In asm for most ISAs, pure-load and pure-store of naturally-aligned int and long are atomic for free, so atomic<T> with memory_order_relaxed can compile to the same asm as plain variables; atomicity (no tearing) doesn't require any special asm. For example: Why is integer assignment on a naturally aligned variable atomic on x86? Depending on surrounding code, the compiler might not manage to optimize out any accesses to non-atomic objects, in which case code-gen will be the same between plain T and atomic<T> with mo_relaxed.

The reverse is not true: It's not at all safe to write C++ as if you were writing in asm. In C++, multiple threads accessing the same object at the same time is data-race undefined behaviour, unless all the accesses are reads.

Thus C++ compilers are allowed to assume that no other threads are changing a variable in a loop, per the "as-if" optimization rule. If bool done is not atomic, a loop like while(!done) { } will compile into if(!done) infinite_loop;, hoisting the load out of the loop. See Multithreading program stuck in optimized mode but runs normally in -O0 for a detailed example with compiler asm output. (Compiling with optimization disabled is very similar to making every object volatile: memory in sync with the abstract machine between C++ statements for consistent debugging.)


Also obviously RMW operations like += or var.fetch_add(1, mo_seq_cst) are atomic and do have to compile to different asm than non-atomic +=. Can num++ be atomic for 'int num'?


The constraints on the optimizer placed by atomic operations are similar to what volatile does. In practice volatile is a way to roll your own mo_relaxed atomic<T>, but without any easy way to get ordering wrt. other operations. It's de-facto supported on some compilers, like GCC, because it's used by the Linux kernel. However, atomic<T> is guaranteed to work by the ISO C++ standard; When to use volatile with multi threading? - there's almost never a reason to roll your own, just use atomic<T> with mo_relaxed.

Also related: Why don't compilers merge redundant std::atomic writes? / Can and does the compiler optimize out two atomic loads? - compilers currently don't optimize atomics at all, so atomic<T> is currently equivalent to volatile atomic<T>, pending further standards work to provide ways for programmers to control when / what optimization would be ok.

Workshop answered 9/9, 2020 at 17:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.