As I see from a test-case: https://godbolt.org/z/K477q1
The generated assembly load/store atomic relaxed is the same as the normal variable: ldr and str
So, is there any difference between relaxed atomic and normal variable?
As I see from a test-case: https://godbolt.org/z/K477q1
The generated assembly load/store atomic relaxed is the same as the normal variable: ldr and str
So, is there any difference between relaxed atomic and normal variable?
The difference is that a normal load/store is not guaranteed to be tear-free, whereas a relaxed atomic read/write is. Also, the atomic guarantees that the compiler doesn't rearrange or optimise-out memory accesses in a similar fashion to what volatile
guarantees.
(Pre-C++11, volatile
was an essential part of rolling your own atomics. But now it's obsolete for that purpose. It does still work in practice but is never recommended: When to use volatile with multi threading? - essentially never.)
On most platforms it just happens that the architecture provides a tear-free load/store by default (for aligned int
and long
) so it works out the same in asm if loads and stores don't get optimized away. See Why is integer assignment on a naturally aligned variable atomic on x86? for example. In C++ it's up to you to express how the memory should be accessed in your source code instead of relying on architecture-specific features to make the code work as intended.
If you were hand-writing in asm, your source code would already nail down when values were kept in registers vs. loaded / stored to (shared) memory. In C++, telling the compiler when it can/can't keep values private is part of why std::atomic<T>
exists.
If you read one article on this topic, take a look at the Preshing one here: https://preshing.com/20130618/atomic-vs-non-atomic-operations/
Also try this presentation from CppCon 2017: https://www.youtube.com/watch?v=ZQFzMfHIxng
Links for further reading:
https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering
What is the (slight) difference on the relaxing atomic rules? which includes a link to a Herb Sutter "atomic weapons" article which is also linked here: https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/
Also see Peter Cordes' linked article: https://electronics.stackexchange.com/q/387181
And a related one about the Linux kernel: https://lwn.net/Articles/793253/
No tearing is only part of what you get with std::atomic<T>
- you also avoid data race undefined behaviour.
std::atomic<T>
gives you (even with mo_relaxed) is well-defined behaviour even with unsynchronized reads and writes. Non-atomic reads can be hoisted out of loops because of the as-if rule + data-race UB. See MCU programming - C++ O2 optimization breaks while loop –
Workshop volatile
, rather than C11 _Atomic
or C++11 std::atomic
, but same difference; the assumption of the underlying asm operation being free from tearing is what makes volatile
work as a legacy way to do atomics. Plus the fact hardware has coherent caches.) –
Workshop volatile
works on real hardware because caches are coherent. The Linux kernel uses it successfully. There are no C++ implementations that run std::thread
threads across cores with non-coherent caches. –
Workshop volatile
is not enough of a guarantee that a read from CPU1 via its data cache will immediately see a write from CPU2 via its data cache and flush to RAM; you would need to add acquire and release memory barriers to guarantee that?. –
Andersen false sharing
youtube.com/watch?v=dznxqe1Uk3E –
Andersen Very good question actually, and I asked the same question when I started leaning concurrency.
I'll answer as simple as possible, even though the answer is a bit more complicated.
Reading and writing to the same non atomic variable from different threads* is undefined behavior - one thread is not guaranteed to read the value that the other thread wrote.
Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.
In fact, atomics are always thread safe, regardless of the memory order! The memory order is not for the atomics -> it's for non atomic data.
Here is the thing - if you use locks, you don't have to think about those low-level things. memory orders are used in lock-free environments where we need to synchronize non atomic data.
Here is the beautiful thing about lock free algorithms, we use atomic operations that are always thread safe, but we "piggy-pack" those operations with memory orders to synchronize the non atomic data used in those algorithms.
For example, a lock-free linked list. Usually, a lock-free link list node looks something like this:
Node:
Atomic<Node*> next_node;
T non_atomic_data
Now, let's say I push a new node into the list. next_node
is always thread safe, another thread will always see the latest atomic value.
But who grantees that other threads see the correct value of non_atomic_data
?
No-one.
Here is a perfect example of the usage of memory orders - we "piggyback" atomic stores and loads to next_node
by also adding memory orders that synchronize the value of non_atomic_data
.
So when we store a new node to the list, we use memory_order_release
to "push" the non atomic data to the main memory. when we read the new node by reading next_node
, we use memory_order_acquire
and then we "pull" the non atomic data from the main memory.
This way we assure that both next_node
and non_atomic_data
are always synchronized across threads.
memory_order_relaxed
doesn't synchronize any non-atomic data, it synchronizes only itself - the atomic variable. When this is used, developers can assume that the atomic variable doesn't reference any non-atomic data published by the same thread that wrote the atomic variable. In other words, that atomic variable isn't, for example, an index of a non-atomic array, or a pointer to non atomic data, or an iterator to some non-thread safe collection. (It would be fine to use relaxed atomic stores and loads for an index into a constant lookup table, or one that's synchronized separately. You only need acq/rel synchronization if the pointed-to or indexed data was written by the same thread.)
This is faster (at least on some architectures) than using stronger memory orders but can be used in fewer cases.
Great, but even this is not the full answer. I said memory orders are not used for atomics. I was half-lying.
With relaxed memory order, atomics are still thread safe. but they have a downside - they can be re-ordered. look at the following snippet:
a.store(1, std::memory_order_relaxed);
b.store(2, std::memory_order_relaxed);
In reality, a.store
can happen after b.store
. The CPU does this all the times, it's called Out of Order Execution and its one of the optimizations techniques CPUs use to speed up execution. a
and b
are still thread-safe, even though the thread-safe stores might happen in a reverse order.
Now, what happens if there is a meaning for the order? Many lock-free algorithms depend on the order of atomic operations for their correctness.
Memory orders are also used to prevent reordering. This is why memory orders are so complicated, because they do 2 things at the same time.
memory_order_acquire
tells the compiler and CPU not to execute operations that happen after it code-wise, before it.
similarity, memory_order_release
tells the compiler and CPU not to execute operations that before it code-wise, after it.
memory_order_relaxed
tells the compiler/cpu that the atomic operation can be re-ordered is possible, in a similar way non atomic operations are reordered whenever possible.
mo_relaxed
stores. So the key point is that ordering is wrt. other objects, whether they're atomic or not. But yes, agreed with the overall point you're making. –
Workshop atomic
load is not missed with the same instruction as non-atomic
? –
Ifill mov
is anyway atomic. So compiling an atomic store on x86 is equivalent to compiling a non atomic store - they are anyway atomic on the hardware level. Other operations, like inc
are not atomic. when compiling fetch_add
on x86, the compiler has to insert a loc
prefix to make inc
atomic. On ARM, the story could be entirely different. I'm not an ARM expert but from my knowledge, no assemby instruction is "intrinsically" atomic. –
Bechuana atomic<T>
constrains the optimizer to not assume the value is unchanged between accesses in the same thread.
atomic<T>
also makes sure the object is sufficiently aligned: e.g. some C++ implementations for 32-bit ISAs have alignof(int64_t) = 4
but alignof(atomic<int64_t>) = 8
to enable lock-free 64-bit operations. (e.g. gcc for 32-bit x86 GNU/Linux). In that case, usually a special instruction is needed that the compiler might not use otherwise, e.g. ARMv8 32-bit ldp
load-pair, or x86 SSE2 movq xmm
before bouncing to integer regs.
In asm for most ISAs, pure-load and pure-store of naturally-aligned int
and long
are atomic for free, so atomic<T>
with memory_order_relaxed
can compile to the same asm as plain variables; atomicity (no tearing) doesn't require any special asm. For example: Why is integer assignment on a naturally aligned variable atomic on x86? Depending on surrounding code, the compiler might not manage to optimize out any accesses to non-atomic objects, in which case code-gen will be the same between plain T
and atomic<T>
with mo_relaxed.
The reverse is not true: It's not at all safe to write C++ as if you were writing in asm. In C++, multiple threads accessing the same object at the same time is data-race undefined behaviour, unless all the accesses are reads.
Thus C++ compilers are allowed to assume that no other threads are changing a variable in a loop, per the "as-if" optimization rule. If bool done
is not atomic, a loop like while(!done) { }
will compile into if(!done) infinite_loop;
, hoisting the load out of the loop. See Multithreading program stuck in optimized mode but runs normally in -O0 for a detailed example with compiler asm output. (Compiling with optimization disabled is very similar to making every object volatile
: memory in sync with the abstract machine between C++ statements for consistent debugging.)
Also obviously RMW operations like +=
or var.fetch_add(1, mo_seq_cst)
are atomic and do have to compile to different asm than non-atomic +=
. Can num++ be atomic for 'int num'?
The constraints on the optimizer placed by atomic operations are similar to what volatile
does. In practice volatile
is a way to roll your own mo_relaxed
atomic<T>
, but without any easy way to get ordering wrt. other operations. It's de-facto supported on some compilers, like GCC, because it's used by the Linux kernel. However, atomic<T>
is guaranteed to work by the ISO C++ standard; When to use volatile with multi threading? - there's almost never a reason to roll your own, just use atomic<T>
with mo_relaxed
.
Also related: Why don't compilers merge redundant std::atomic writes? / Can and does the compiler optimize out two atomic loads? - compilers currently don't optimize atomics at all, so atomic<T>
is currently equivalent to volatile atomic<T>
, pending further standards work to provide ways for programmers to control when / what optimization would be ok.
© 2022 - 2024 — McMap. All rights reserved.
volatile
doesn't protect from multithreading issues. Since hardware has change a lot since this keyword was introduced it become useless. Problem is cache synchronization whatvolatile
is unable to handle. I can't find a link where this is nicely explained, I will provide it when I find it. – Roughhewmemory_order_relaxed
withvolatile
on real hardware, but don't because there's no benefit. When to use volatile with multi threading? - basically never. But it does work in practice; the Linux kernel usesvolatile
this way. Real hardware (that we run threads and single-system-image OSes across) has coherent caches, please don't spread misinformation about how CPUs work. (It's not guaranteed safe by ISO C of course;volatile
doesn't change the fact that it's data-race UB) – Workshop