C++20 includes specializations for atomic<float>
and atomic<double>
. Can anyone here explain for what practical purpose this should be good for? The only purpose I can imagine is when I have a thread that changes an atomic double or float asynchronously at random points and other threads read this values asynchronously (but a volatile double or float should in fact do the same on most platforms). But the need for this should be extremely rare. I think this rare case couldn't justify an inclusion into the C++20 standard.
atomic<float>
and atomic<double>
have existed since C++11. The atomic<T>
template works for arbitrary trivially-copyable T
. Everything you could hack up with legacy pre-C++11 use of volatile
for shared variables can be done with C++11 atomic<double>
with std::memory_order_relaxed
.
What doesn't exist until C++20 are atomic RMW operations like x.fetch_add(3.14);
or for short x += 3.14
. (Why isn't atomic double fully implemented wonders why not). Those member functions were only available in the atomic
integer specializations, so you could only load, store, exchange, and CAS on float
and double
, like for arbitrary T
like class types.
See Atomic double floating point or SSE/AVX vector load/store on x86_64 for details on how to roll your own with compare_exchange_weak
, and how that (and pure load, pure store, and exchange) compiles in practice with GCC and clang for x86. (Not always optimal, gcc bouncing to integer regs unnecessarily.) Also for details on lack of atomic<__m128i>
load/store because vendors won't publish real guarantees to let us take advantage (in a future-proof way) of what current HW does.
These new specializations provide maybe some efficiency (on non-x86) and convenience with fetch_add
and fetch_sub
(and the equivalent +=
and -=
overloads). Only those 2 operations that are supported, not fetch_mul
or anything else. See the current draft of 31.8.3 Specializations for floating-point types, and cppreference std::atomic
It's not like the committee went out of their way to introduce new FP-relevant atomic RMW member functions fetch_mul
, min, max, or even absolute value or negation, which is ironically easier in asm, just bitwise AND or XOR to clear or flip the sign bit and can be done with x86 lock and
if the old value isn't needed. Actually since carry-out from the MSB doesn't matter, 64-bit lock xadd
can implement fetch_xor
with 1ULL<<63
. Assuming of course IEEE754 style sign/magnitude FP. Similarly easy on LL/SC machines that can do 4-byte or 8-byte fetch_xor, and they can easily keep the old value in a register.
So the one thing that could be done significantly more efficiently in x86 asm than in portable C++ without union hacks (atomic bitwise ops on FP bit patterns) still isn't exposed by ISO C++.
It makes sense that the integer specializations don't have fetch_mul
: integer add is much cheaper, typically 1 cycle latency, the same level of complexity as atomic CAS. But for floating point, multiply and add are both quite complex and typically have similar latency. Moreover, if atomic RMW fetch_add
is useful for anything, I'd assume fetch_mul
would be, too. Again unlike integer where lockless algorithms commonly add/sub but very rarely need to build an atomic shift or mul out of a CAS. x86 doesn't have memory-destination multiply so has no direct HW support for lock imul
.
It seems like this is more a matter of bringing atomic<double>
up to the level you might naively expect (supporting .fetch_add
and sub like integers), not of providing a serious library of atomic RMW FP operations. Perhaps that makes it easier to write templates that don't have to check for integral, just numeric, types?
Can anyone here explain for what practical purpose this should be good for?
For pure store / pure load, maybe some global scale factor that you want to be able to publish to all threads with a simple store? And readers load it before every work unit or something. Or just as part of a lockless queue or stack of double
.
It's not a coincidence that it took until C++20 for anyone to say "we should provide fetch_add for atomic<double>
in case anyone wants it."
Plausible use-case: to manually multi-thread the sum of an array (instead of using #pragma omp parallel for simd reduction(+:my_sum_variable)
or a standard <algorithm>
like std::accumulate
with a C++17 parallel execution policy).
The parent thread might start with atomic<double> total = 0;
and pass it by reference to each thread. Then threads do *totalptr += sum_region(array+TID*size, size)
to accumulate the results. Instead of having a separate output variable for each thread and collecting the results in one caller. It's not bad for contention unless all threads finish at nearly the same time. (Which is not unlikely, but it's at least a plausible scenario.)
If you just want separate load and separate store atomicity like you're hoping for from volatile
, you already have that with C++11.
Don't use volatile
for threading: use atomic<T>
with mo_relaxed
See When to use volatile with multi threading? for details on mo_relaxed atomic vs. legacy volatile
for multithreading. volatile
data races are UB, but it does work in practice as part of roll-your-own atomics on compilers that support it, with inline asm needed if you want any ordering wrt. other operations, or if you want RMW atomicity instead of separate load / ALU / separate store. All mainstream CPUs have coherent cache/shared memory. But with C++11 there's no reason to do that: std::atomic<>
obsoleted hand-rolled volatile
shared variables.
At least in theory. In practice some compilers (like GCC) still have missed-optimizations for atomic<double>
/ atomic<float>
even for just simple load and store. (And the C++20 new overloads aren't implemented yet on Godbolt). atomic<integer>
is fine though, and does optimize as well as volatile or plain integer + memory barriers.
In some ABIs (like 32-bit x86), alignof(double)
is only 4. Compilers normally align it by 8 but inside structs they have to follow the ABI's struct packing rules so an under-aligned volatile double
is possible. Tearing will be possible in practice if it splits a cache-line boundary, or on some AMD an 8-byte boundary. atomic<double>
instead of volatile
can plausibly matter for correctness on some real platforms, even when you don't need atomic RMW. e.g. this G++ bug which was fixed by increasing using alignas()
in the std::atomic<>
implementation for objects small enough to be lock_free.
(And of course there are platforms where an 8-byte store isn't naturally atomic so to avoid tearing you need a fallback to a lock. If you care about such platforms, a publish-occasionally model should use a hand-rolled SeqLock or atomic<float>
if atomic<double>
isn't always_lock_free
.)
You can get the same efficient code-gen (without extra barrier instructions) from atomic<T>
using mo_relaxed as you can with volatile
. Unfortunately in practice, not all compilers have efficient atomic<double>
. For example, GCC9 for x86-64 copies from XMM to general-purpose integer registers.
#include <atomic>
volatile double vx;
std::atomic<double> ax;
double px; // plain x
void FP_non_RMW_increment() {
px += 1.0;
vx += 1.0; // equivalent to vx = vx + 1.0
ax.store( ax.load(std::memory_order_relaxed) + 1.0, std::memory_order_relaxed);
}
#if __cplusplus > 201703L // is there a number for C++2a yet?
// C++20 only, not yet supported by libstdc++ or libc++
void atomic_RMW_increment() {
ax += 1.0; // seq_cst
ax.fetch_add(1.0, std::memory_order_relaxed);
}
#endif
Godbolt GCC9 for x86-64, gcc -O3. (Also included an integer version)
FP_non_RMW_increment():
movsd xmm0, QWORD PTR .LC0[rip] # xmm0 = double 1.0
movsd xmm1, QWORD PTR px[rip] # load
addsd xmm1, xmm0 # plain x += 1.0
movsd QWORD PTR px[rip], xmm1 # store
movsd xmm1, QWORD PTR vx[rip]
addsd xmm1, xmm0 # volatile x += 1.0
movsd QWORD PTR vx[rip], xmm1
mov rax, QWORD PTR ax[rip] # integer load
movq xmm2, rax # copy to FP register
addsd xmm0, xmm2 # atomic x += 1.0
movq rax, xmm0 # copy back to integer
mov QWORD PTR ax[rip], rax # store
ret
clang compiles it efficiently, with the same move-scalar-double load and store for ax
as for vx
and px
.
Fun fact: C++20 apparently deprecates vx += 1.0
. Perhaps this is to help avoid confusion between separate load and store like vx = vx + 1.0 vs. atomic RMW? To make it clear there are 2 separate volatile accesses in that statement?
<source>: In function 'void FP_non_RMW_increment()':
<source>:9:8: warning: compound assignment with 'volatile'-qualified left operand is deprecated [-Wvolatile]
9 | vx += 1.0; // equivalent to vx = vx + 1.0
| ~~~^~~~~~
Note that x = x + 1
is not the same thing as x += 1
for atomic<T> x
: the former loads into a temporary, adds, then stores. (With sequential-consistency for both).
EDIT: Adding Ulrich Eckhardt's comment to clarify: 'Let me try to rephrase that: Even if volatile on one particular platform/environment/compiler did the same thing as atomic<>, down to the generated machine code, then atomic<> is still much more expressive in its guarantees and furthermore, it is guaranteed to be portable. Moreover, when you can write self-documenting code, then you should do that.'
Volatile sometimes has the below 2 effects:
- Prevents compilers from caching the value in a register.
- Prevents optimizing away accesses to that value when they seem unnecessary from the POV of your program.
See also Understanding volatile keyword in c++
TLDR;
Be explicit about what you want.
- Do not rely on 'volatile' do do what you want, if 'what' is not the original purpose of volatile, e.g. enabling external sensors or DMA to change a memory address without the compiler interfering.
- If you want an atomic, use std::atomic.
- If you want to disable strict aliasing optimizations, do like the Linux kernel, and disable strict aliasing optimizations on e.g. gcc.
- If you want to disable other kinds of compiler optimizations, use compiler intrinsics or code explicit assembly for e.g ARM or x86_64.
- If you want 'restrict' keyword semantics like in C, use the corresponding restrict intrinsic in C++ on your compiler, if available.
- In short, do not rely on compiler- and CPU-family dependent behavior if constructs provided by the standard are clearer and more portable. Use e.g. godbolt.org to compare the assembler output if you believe your 'hack' is more efficient than doing it the right way.
From std::memory_order
Relationship with volatile
Within a thread of execution, accesses (reads and writes) through volatile glvalues cannot be reordered past observable side-effects (including other volatile accesses) that are sequenced-before or sequenced-after within the same thread, but this order is not guaranteed to be observed by another thread, since volatile access does not establish inter-thread synchronization.
In addition, volatile accesses are not atomic (concurrent read and write is a data race) and do not order memory (non-volatile memory accesses may be freely reordered around the volatile access).
One notable exception is Visual Studio, where, with default settings, every volatile write has release semantics and every volatile read has acquire semantics (MSDN), and thus volatiles may be used for inter-thread synchronization. Standard volatile semantics are not applicable to multithreaded programming, although they are sufficient for e.g. communication with a std::signal handler that runs in the same thread when applied to sig_atomic_t variables.
As a final rant: In practice, the only feasible languages for building an OS kernel are usually C and C++. Given that, I would like provisions in the 2 standards for 'telling the compiler to butt out', i.e. to be able to explicitly tell the compiler to not change the 'intent' of the code. The purpose would be to use C or C++ as a portable assembler, to an even greater degree than today.
An somewhat silly code example is worth compiling on e.g. godbolt.org for ARM and x86_64, both gcc, to see that in the ARM case, the compiler generates two __sync_synchronize (HW CPU barrier) operations for the atomic, but not for the volatile variant of the code (uncomment the one you want). The point being that using atomic gives predictable, portable behavior.
#include <inttypes.h>
#include <atomic>
std::atomic<uint32_t> sensorval;
//volatile uint32_t sensorval;
uint32_t foo()
{
uint32_t retval = sensorval;
return retval;
}
int main()
{
return (int)foo();
}
Godbolt output for ARM gcc 8.3.1:
foo():
push {r4, lr}
ldr r4, .L4
bl __sync_synchronize
ldr r4, [r4]
bl __sync_synchronize
mov r0, r4
pop {r4, lr}
bx lr
.L4:
.word .LANCHOR0
For those who want an X86 example, a colleague of mine, Angus Lepper, graciously contributed this example: godbolt example of bad volatile use on x86_64
volatile
on one particular platform/environment/compiler did the same thing as atomic<>
, down to the generated machine code, then atomic<>
is still much more expressive in its guarantees and furthermore, it is guaranteed to be portable. Moreover, when you can write self-documenting code, then you should do that. –
Resignation atomic<double>
with std::memory_order_relaxed
. You'll get same machine-code you want on normal ISAs like ARM and x86, but without any UB. And it will guarantee your object is actually sufficiently aligned for atomicity, which plain volatile double
might not. (e.g. on 32-bit x86, most ABIs have alignof(double) = 4
. They naturally align it when possible, but inside structs they can't. But alignof(atomic<double>)
is 8.) –
Denmark volatile
for an exit
flag happens to work but is data-race UB. It could in theory fail on a hypothetical machine without coherent caches that needs manual flushing for inter-thread visibility. See When to use volatile with multi threading? - all mainstream C++ implementations run threads across cache-coherent shared address space. It's technically not required by ISO C++, but I don't think a high-performance implementation is possible. –
Denmark __sync_synchronize
function calls (full memory barrier) to implement seq_cst ordering semantics. -mcpu=cortex-a53
would allow it to use STLR for just a sequential release, even in 32-bit mode. Or you could at least mention that mo_relaxed
is basically volatile
on normal platforms, but avoids UB and properly respects other memory-ordering stuff. Except for RMW operations being an atomic RMW instead of separate load+store, so x = x + 1
and x+=1
are different things. –
Denmark -mcpu=
high enough for it to inline a dsb ish
for operations stronger than relaxed
. –
Denmark volatile
being obsoleted by atomic<T>
(but you forgot to mention std::memory_order_relaxed
to get similar efficient asm to volatile
, because if volatile worked you obviously didn't need seq_cst
.) I added an answer that tries to answer the whole question, as well as crapping on volatile
. @UlrichEckhardt: unless you manually use .store(std::memory_order_relaxed)
, your asm will never be as cheap as volatile
. But yes, you should in general use that. With relaxed it's as cheap. –
Denmark atomic<bool>
with std::memory_order_relaxed
load and store. It's always_lock_free
on any sane platform and compiles to the same asm as volatile
. And isn't "ordered" wrt. other volatile accesses so in theory it could allow more freedom for the optimizer. C++11 made hand-rolled atomics using volatile
obsolete. But to get the same efficiency you have to use .store(value, std::memory_order_relaxed)
(or at most release
on x86), because seq_cst is the default for var = value;
–
Denmark std::atomic
to have atomicity anyway: if a type isn't atomic, either use mutexes or change you design (maybe a pointer to the object instead of the object). Don't try to have portable code for what is essentially architecture specific optimization and an attempt to have the most efficient codegen possible. State you assumptions and don't be more generic. Still you should use atomic types. –
Gnaw volatile
does still work on platforms that previously supported it; see When to use volatile with multi threading? for more detail about why it happens to work even though it's UB, and how exactly atomic
with mo_relaxed
is equivalent. My answer on this question explains the same thing. –
Denmark struct foo = *volatile_ptr
, you can't even make the payload volatile
if it has class type, and need compiler-specific memory barrier stuff. But for int64_t you can use volatile. my attempt –
Denmark typedef long aliasing_long __attribute__((may_alias))
to get a long*
that works like char*
. But casting a ptr to a declared non volatile variable to volatile is something I hadn't read about as a recent change; do you have a link? –
Denmark The only purpose I can imagine is when I have a thread that changes an atomic double or float asynchronously at random points and other threads read this values asynchronously
Yes, this is the only purpose of an atomic regardless of the actual type. may it be an atomic bool
, char
, int
, long
or whatever.
Whatever usage you have for type
, std::atomic<type>
is a thread-safe version of it.
Whatever usage you have for a float
or a double
, std::atomic<float/double>
can be written, read or compared with a thread-safe manner.
saying that std::atomic<float/double>
has only rare usages is practically saying that float/double
have rare usages.
© 2022 - 2024 — McMap. All rights reserved.
volatile
for thread synchronisation: "...This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution,..." source: en.cppreference.com/w/cpp/language/cv – Ancalinvolatile
does not meanatomic
. Remove that belief from your mind. – Corruptionstd::atomic<T>
(translating Java pretend "not a pointer" references to a pointer of course). – Gnaw