C++20 std::atomic<float>- std::atomic<double>.specializations
Asked Answered
S

3

1

C++20 includes specializations for atomic<float> and atomic<double>. Can anyone here explain for what practical purpose this should be good for? The only purpose I can imagine is when I have a thread that changes an atomic double or float asynchronously at random points and other threads read this values asynchronously (but a volatile double or float should in fact do the same on most platforms). But the need for this should be extremely rare. I think this rare case couldn't justify an inclusion into the C++20 standard.

Synonymous answered 3/11, 2019 at 14:5 Comment(11)
It is not rare, floating point types are special on many architectures. Could be a co-processor, could be dedicated cpu registers that store them, double can be too large to be atomic by default.Ratcliffe
Don't use volatile for thread synchronisation: "...This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution,..." source: en.cppreference.com/w/cpp/language/cvAncalin
volatile does not mean atomic. Remove that belief from your mind.Corruption
C++ doesn's specify what volatile does exactly and there is no ordering of the visibilify to other threads. But for certain purposes on many platforms this isn't an issue.Synonymous
It's not rare at all. A simple common case is matrix-vector multiplication, that requires updates for output vector elements. With sparse matrices, the row-wise mapping of matrix elements to threads is not always the best option. Just imagine a joint matrix-vector + transposed-matrix-vector multiplication (BiCG) and that you want to iterate over matrix elements only once. (The only problem is that floating-point operations are mostly performed by SIMD units, and these do not support atomic operations, e.g., on x64.)Mam
@DouglasQuaid: First of all, if you come from Java, meaning of volatile is completely different there. 2nd, do NOT use volatile as a totally misinterpreted substitute for atomics and proper pthread or C++ mutexes!Dipterous
@ErikAlapää On first approximation, a Java volatile scalar variable of type T can be translated to std::atomic<T> (translating Java pretend "not a pointer" references to a pointer of course).Gnaw
Daniel, to synchronize with an atomic<floattype> while calculating a matrix-multiplication is the wrong and inefficient way. The right way to parallelize matrix-multiplication to partition row-wise on the matrix and transposed matrix to be multiplicated.Synonymous
Erik, volatile has a historically defined meaning and the compilers adhere to that, although there's no explicit consistency-model given from the C++-standard.Synonymous
@DouglasQuaid I described the working of volatile and when to (not) use volatile.Gnaw
Atomic float operations are extremely useful for Monte-Carlo computing. They are the c++20 feature I rely upon the most :P It's not useful in your use cases, it may be in othersLargish
D
7

atomic<float> and atomic<double> have existed since C++11. The atomic<T> template works for arbitrary trivially-copyable T. Everything you could hack up with legacy pre-C++11 use of volatile for shared variables can be done with C++11 atomic<double> with std::memory_order_relaxed.

What doesn't exist until C++20 are atomic RMW operations like x.fetch_add(3.14); or for short x += 3.14. (Why isn't atomic double fully implemented wonders why not). Those member functions were only available in the atomic integer specializations, so you could only load, store, exchange, and CAS on float and double, like for arbitrary T like class types.

See Atomic double floating point or SSE/AVX vector load/store on x86_64 for details on how to roll your own with compare_exchange_weak, and how that (and pure load, pure store, and exchange) compiles in practice with GCC and clang for x86. (Not always optimal, gcc bouncing to integer regs unnecessarily.) Also for details on lack of atomic<__m128i> load/store because vendors won't publish real guarantees to let us take advantage (in a future-proof way) of what current HW does.

These new specializations provide maybe some efficiency (on non-x86) and convenience with fetch_add and fetch_sub (and the equivalent += and -= overloads). Only those 2 operations that are supported, not fetch_mul or anything else. See the current draft of 31.8.3 Specializations for floating-point types, and cppreference std::atomic

It's not like the committee went out of their way to introduce new FP-relevant atomic RMW member functions fetch_mul, min, max, or even absolute value or negation, which is ironically easier in asm, just bitwise AND or XOR to clear or flip the sign bit and can be done with x86 lock and if the old value isn't needed. Actually since carry-out from the MSB doesn't matter, 64-bit lock xadd can implement fetch_xor with 1ULL<<63. Assuming of course IEEE754 style sign/magnitude FP. Similarly easy on LL/SC machines that can do 4-byte or 8-byte fetch_xor, and they can easily keep the old value in a register.

So the one thing that could be done significantly more efficiently in x86 asm than in portable C++ without union hacks (atomic bitwise ops on FP bit patterns) still isn't exposed by ISO C++.

It makes sense that the integer specializations don't have fetch_mul: integer add is much cheaper, typically 1 cycle latency, the same level of complexity as atomic CAS. But for floating point, multiply and add are both quite complex and typically have similar latency. Moreover, if atomic RMW fetch_add is useful for anything, I'd assume fetch_mul would be, too. Again unlike integer where lockless algorithms commonly add/sub but very rarely need to build an atomic shift or mul out of a CAS. x86 doesn't have memory-destination multiply so has no direct HW support for lock imul.

It seems like this is more a matter of bringing atomic<double> up to the level you might naively expect (supporting .fetch_add and sub like integers), not of providing a serious library of atomic RMW FP operations. Perhaps that makes it easier to write templates that don't have to check for integral, just numeric, types?

Can anyone here explain for what practical purpose this should be good for?

For pure store / pure load, maybe some global scale factor that you want to be able to publish to all threads with a simple store? And readers load it before every work unit or something. Or just as part of a lockless queue or stack of double.

It's not a coincidence that it took until C++20 for anyone to say "we should provide fetch_add for atomic<double> in case anyone wants it."

Plausible use-case: to manually multi-thread the sum of an array (instead of using #pragma omp parallel for simd reduction(+:my_sum_variable) or a standard <algorithm> like std::accumulate with a C++17 parallel execution policy).

The parent thread might start with atomic<double> total = 0; and pass it by reference to each thread. Then threads do *totalptr += sum_region(array+TID*size, size) to accumulate the results. Instead of having a separate output variable for each thread and collecting the results in one caller. It's not bad for contention unless all threads finish at nearly the same time. (Which is not unlikely, but it's at least a plausible scenario.)


If you just want separate load and separate store atomicity like you're hoping for from volatile, you already have that with C++11.

Don't use volatile for threading: use atomic<T> with mo_relaxed

See When to use volatile with multi threading? for details on mo_relaxed atomic vs. legacy volatile for multithreading. volatile data races are UB, but it does work in practice as part of roll-your-own atomics on compilers that support it, with inline asm needed if you want any ordering wrt. other operations, or if you want RMW atomicity instead of separate load / ALU / separate store. All mainstream CPUs have coherent cache/shared memory. But with C++11 there's no reason to do that: std::atomic<> obsoleted hand-rolled volatile shared variables.

At least in theory. In practice some compilers (like GCC) still have missed-optimizations for atomic<double> / atomic<float> even for just simple load and store. (And the C++20 new overloads aren't implemented yet on Godbolt). atomic<integer> is fine though, and does optimize as well as volatile or plain integer + memory barriers.

In some ABIs (like 32-bit x86), alignof(double) is only 4. Compilers normally align it by 8 but inside structs they have to follow the ABI's struct packing rules so an under-aligned volatile double is possible. Tearing will be possible in practice if it splits a cache-line boundary, or on some AMD an 8-byte boundary. atomic<double> instead of volatile can plausibly matter for correctness on some real platforms, even when you don't need atomic RMW. e.g. this G++ bug which was fixed by increasing using alignas() in the std::atomic<> implementation for objects small enough to be lock_free.

(And of course there are platforms where an 8-byte store isn't naturally atomic so to avoid tearing you need a fallback to a lock. If you care about such platforms, a publish-occasionally model should use a hand-rolled SeqLock or atomic<float> if atomic<double> isn't always_lock_free.)


You can get the same efficient code-gen (without extra barrier instructions) from atomic<T> using mo_relaxed as you can with volatile. Unfortunately in practice, not all compilers have efficient atomic<double>. For example, GCC9 for x86-64 copies from XMM to general-purpose integer registers.

#include <atomic>

volatile double vx;
std::atomic<double> ax;
double px; // plain x

void FP_non_RMW_increment() {
    px += 1.0;
    vx += 1.0;     // equivalent to vx = vx + 1.0
    ax.store( ax.load(std::memory_order_relaxed) + 1.0, std::memory_order_relaxed);
}

#if __cplusplus > 201703L    // is there a number for C++2a yet?
// C++20 only, not yet supported by libstdc++ or libc++
void atomic_RMW_increment() {
    ax += 1.0;           // seq_cst
    ax.fetch_add(1.0, std::memory_order_relaxed);   
}
#endif

Godbolt GCC9 for x86-64, gcc -O3. (Also included an integer version)

FP_non_RMW_increment():
        movsd   xmm0, QWORD PTR .LC0[rip]   # xmm0 = double 1.0 

        movsd   xmm1, QWORD PTR px[rip]        # load
        addsd   xmm1, xmm0                     # plain x += 1.0
        movsd   QWORD PTR px[rip], xmm1        # store

        movsd   xmm1, QWORD PTR vx[rip]
        addsd   xmm1, xmm0                     # volatile x += 1.0
        movsd   QWORD PTR vx[rip], xmm1

        mov     rax, QWORD PTR ax[rip]      # integer load
        movq    xmm2, rax                   # copy to FP register
        addsd   xmm0, xmm2                     # atomic x += 1.0
        movq    rax, xmm0                   # copy back to integer
        mov     QWORD PTR ax[rip], rax      # store

        ret

clang compiles it efficiently, with the same move-scalar-double load and store for ax as for vx and px.

Fun fact: C++20 apparently deprecates vx += 1.0. Perhaps this is to help avoid confusion between separate load and store like vx = vx + 1.0 vs. atomic RMW? To make it clear there are 2 separate volatile accesses in that statement?

<source>: In function 'void FP_non_RMW_increment()':
<source>:9:8: warning: compound assignment with 'volatile'-qualified left operand is deprecated [-Wvolatile]
    9 |     vx += 1.0;     // equivalent to vx = vx + 1.0
      |     ~~~^~~~~~


Note that x = x + 1 is not the same thing as x += 1 for atomic<T> x: the former loads into a temporary, adds, then stores. (With sequential-consistency for both).

Denmark answered 20/11, 2019 at 1:34 Comment(0)
E
3

EDIT: Adding Ulrich Eckhardt's comment to clarify: 'Let me try to rephrase that: Even if volatile on one particular platform/environment/compiler did the same thing as atomic<>, down to the generated machine code, then atomic<> is still much more expressive in its guarantees and furthermore, it is guaranteed to be portable. Moreover, when you can write self-documenting code, then you should do that.'

Volatile sometimes has the below 2 effects:

  1. Prevents compilers from caching the value in a register.
  2. Prevents optimizing away accesses to that value when they seem unnecessary from the POV of your program.

See also Understanding volatile keyword in c++

TLDR;

Be explicit about what you want.

  • Do not rely on 'volatile' do do what you want, if 'what' is not the original purpose of volatile, e.g. enabling external sensors or DMA to change a memory address without the compiler interfering.
  • If you want an atomic, use std::atomic.
  • If you want to disable strict aliasing optimizations, do like the Linux kernel, and disable strict aliasing optimizations on e.g. gcc.
  • If you want to disable other kinds of compiler optimizations, use compiler intrinsics or code explicit assembly for e.g ARM or x86_64.
  • If you want 'restrict' keyword semantics like in C, use the corresponding restrict intrinsic in C++ on your compiler, if available.
  • In short, do not rely on compiler- and CPU-family dependent behavior if constructs provided by the standard are clearer and more portable. Use e.g. godbolt.org to compare the assembler output if you believe your 'hack' is more efficient than doing it the right way.

From std::memory_order

Relationship with volatile

Within a thread of execution, accesses (reads and writes) through volatile glvalues cannot be reordered past observable side-effects (including other volatile accesses) that are sequenced-before or sequenced-after within the same thread, but this order is not guaranteed to be observed by another thread, since volatile access does not establish inter-thread synchronization.

In addition, volatile accesses are not atomic (concurrent read and write is a data race) and do not order memory (non-volatile memory accesses may be freely reordered around the volatile access).

One notable exception is Visual Studio, where, with default settings, every volatile write has release semantics and every volatile read has acquire semantics (MSDN), and thus volatiles may be used for inter-thread synchronization. Standard volatile semantics are not applicable to multithreaded programming, although they are sufficient for e.g. communication with a std::signal handler that runs in the same thread when applied to sig_atomic_t variables.

As a final rant: In practice, the only feasible languages for building an OS kernel are usually C and C++. Given that, I would like provisions in the 2 standards for 'telling the compiler to butt out', i.e. to be able to explicitly tell the compiler to not change the 'intent' of the code. The purpose would be to use C or C++ as a portable assembler, to an even greater degree than today.

An somewhat silly code example is worth compiling on e.g. godbolt.org for ARM and x86_64, both gcc, to see that in the ARM case, the compiler generates two __sync_synchronize (HW CPU barrier) operations for the atomic, but not for the volatile variant of the code (uncomment the one you want). The point being that using atomic gives predictable, portable behavior.

#include <inttypes.h>
#include <atomic>

std::atomic<uint32_t> sensorval;
//volatile uint32_t sensorval;

uint32_t foo()
{
    uint32_t retval = sensorval;
    return retval;
}
int main()
{
    return (int)foo();
}

Godbolt output for ARM gcc 8.3.1:

foo():
  push {r4, lr}
  ldr r4, .L4
  bl __sync_synchronize
  ldr r4, [r4]
  bl __sync_synchronize
  mov r0, r4
  pop {r4, lr}
  bx lr
.L4:
  .word .LANCHOR0

For those who want an X86 example, a colleague of mine, Angus Lepper, graciously contributed this example: godbolt example of bad volatile use on x86_64

Easterling answered 4/11, 2019 at 10:5 Comment(34)
Infact, when you don't bother about the memory-ordering of the volatile-access, it is the same like an unordered atomic-access on most platforms.Synonymous
@DouglasQuaid And what is your point? Why not use atomic if that is the semantics you need? If necessary, specifying memory order explicitly.Dipterous
I just wanted to say that volatile works often identically.Synonymous
@DouglasQuaid: At least, I and many other code reviewers would reject if you just used volatile as a poor man's atomic. But see my rant about portable assembler in my updated answer.Dipterous
There's nothing poor in the cases where you could the same. F.e. when you want to have a value which could be read or written absolutely asynchronously and you don't need any synchronization and you can be sure that the data-type is atomic by bature on the platform.Synonymous
@DouglasQuaid: I disagree, I strongly object to using volatile when better tools are available, e.g. explicit use of the memory model, explicit mutex, or atomics. But of course, as I say in the answer, if you use volatile ONLY for allowing e.g. a uint32_t to change because of DMA or an sensor change, that is OK. Only that.Dipterous
Added info from cppreference.com about volatile and ordering.Dipterous
I said that volatile might be also appropriate in cases where the memory-model doesn't count. And some compiler specify a special volatile-handling; with some compiler-options a read has aquire-behaviour and a write has release behaviour. Depending on what platforms you might stick this is sufficient.Synonymous
@DouglasQuaid No. Why rely on compiler specific constructs if the standard provides clearer, portable constructs? Show an example.Dipterous
Let me try to rephrase that: Even if volatile on one particular platform/environment/compiler did the same thing as atomic<>, down to the generated machine code, then atomic<> is still much more expressive in its guarantees and furthermore, it is guaranteed to be portable. Moreover, when you can write self-documenting code, then you should do that.Resignation
@UlrichEckhardt: Good way of phrasing it! If you don't mind, I am adding it to my answer, with attribution to you.Dipterous
When you have a thread that does a larger computation you might periodically poll a bool volatile variable which is set by another thread to signal that the computation should be prematurely ended. For such a case a bool volatile fulfills the same purpose as an atomic<bool>.Synonymous
No, that's not wrong. For this purpose an atomic hasn't the least advantage over a volatile. What I said works in fact with any compiler and any machine.Synonymous
This is not a debate. Using volatile like that is using volatile for a purpose where the language offers better alternatives.Dipterous
atomic<bool> wouldn't be better in this case.Synonymous
@DouglasQuaid: Discussion is over.Dipterous
@UlrichEckhardt Using a volatile scalar as an "exit" or "cancel" flag is portable.Gnaw
"a colleague of mine, Angus Lepper, graciously contributed this example" Volatile is often the bad tool even in drivers because you need to make everything externally accessible volatile, not just the final ready flag; that means everything is clumsy and inefficient. C/C++ suck as high level asm.Gnaw
@DouglasQuaid: If you don't need ordering, just atomicity, use atomic<double> with std::memory_order_relaxed. You'll get same machine-code you want on normal ISAs like ARM and x86, but without any UB. And it will guarantee your object is actually sufficiently aligned for atomicity, which plain volatile double might not. (e.g. on 32-bit x86, most ABIs have alignof(double) = 4. They naturally align it when possible, but inside structs they can't. But alignof(atomic<double>) is 8.)Denmark
@curiousguy: usually you're making arguments based on the pure standard, not on what happens to work in real life. volatile for an exit flag happens to work but is data-race UB. It could in theory fail on a hypothetical machine without coherent caches that needs manual flushing for inter-thread visibility. See When to use volatile with multi threading? - all mainstream C++ implementations run threads across cache-coherent shared address space. It's technically not required by ISO C++, but I don't think a high-performance implementation is possible.Denmark
Fun fact: ISO C++ even mentions HW cache coherence in eel.is/c++draft/intro.multithread#intro.races-19 for rules about the modification order of a single atomic object.Denmark
@ErikAlapää : generic ARM seems like a poor choice here, resulting in expensive non-inline __sync_synchronize function calls (full memory barrier) to implement seq_cst ordering semantics. -mcpu=cortex-a53 would allow it to use STLR for just a sequential release, even in 32-bit mode. Or you could at least mention that mo_relaxed is basically volatile on normal platforms, but avoids UB and properly respects other memory-ordering stuff. Except for RMW operations being an atomic RMW instead of separate load+store, so x = x + 1 and x+=1 are different things.Denmark
Or at least a -mcpu= high enough for it to inline a dsb ish for operations stronger than relaxed.Denmark
Anyway, this barely answers the question. It's a good writeup about volatile being obsoleted by atomic<T> (but you forgot to mention std::memory_order_relaxed to get similar efficient asm to volatile, because if volatile worked you obviously didn't need seq_cst.) I added an answer that tries to answer the whole question, as well as crapping on volatile. @UlrichEckhardt: unless you manually use .store(std::memory_order_relaxed), your asm will never be as cheap as volatile. But yes, you should in general use that. With relaxed it's as cheap.Denmark
@ErikAlapää: When you want to have a completely asynchronous value passed to another thread, maybe just a bool, having a volatile without fences is just what you want. The only thing that might be not protable is that the atomicity of writing or reading that value is dependent on its width on some platforms; f.e. a bare uint64_t is not atomic on IA-32.Synonymous
@DouglasQuaid: That is incorrect. What you want in C++11 is atomic<bool> with std::memory_order_relaxed load and store. It's always_lock_free on any sane platform and compiles to the same asm as volatile. And isn't "ordered" wrt. other volatile accesses so in theory it could allow more freedom for the optimizer. C++11 made hand-rolled atomics using volatile obsolete. But to get the same efficiency you have to use .store(value, std::memory_order_relaxed) (or at most release on x86), because seq_cst is the default for var = value;Denmark
@DouglasQuaid "the atomicity of writing or reading that value is dependent on its width on some platforms" yes but you probably should never use std::atomic to have atomicity anyway: if a type isn't atomic, either use mutexes or change you design (maybe a pointer to the object instead of the object). Don't try to have portable code for what is essentially architecture specific optimization and an attempt to have the most efficient codegen possible. State you assumptions and don't be more generic. Still you should use atomic types.Gnaw
But yes, volatile does still work on platforms that previously supported it; see When to use volatile with multi threading? for more detail about why it happens to work even though it's UB, and how exactly atomic with mo_relaxed is equivalent. My answer on this question explains the same thing.Denmark
@PeterCordes It's a special subset of UB where you can sanely reason w/ arch specific knowledge, which you usually can't do w/ modern compilers.Gnaw
@curiousguy: yes, exactly. And it's behaviour that important projects like the Linux kernel rely on, so at least GCC and clang aren't going to break it any time soon. Another use-case for UB is implementing a SeqLock; you want efficient non-atomic stores / loads, not fallback to a spinlock, for the payload, but detect the possibility of tearing. With C++ unable to do struct foo = *volatile_ptr, you can't even make the payload volatile if it has class type, and need compiler-specific memory barrier stuff. But for int64_t you can use volatile. my attemptDenmark
@curiousguy: Semi-related: the ISO C++ standard mentions in a note that compilers can introduce data-race loads (unlike stores) in asm on platforms where that's safe. But that some HW might exist with race-detection/prevention built in. Or that it might trip up hypothetical race-detection debugging mechanisms.Denmark
@PeterCordes"important projects like the Linux kernel rely on, so at least GCC and clang aren't going to break it any time soon" I really don't think that follows: GCC broke many expectations that were present in many places in linux, including but not limited integer overflow, null ptr dereference, preparing network packets w/ a datatype and sending w/ another, the behavior of some inline asm, and more recently casting a ptr to a declared non volatile variable to volatile.Gnaw
Most of those things are general no-UB optimizations, including strict-aliasing, that potentially gain performance in normal safe code. It's possible to fix them by writing safe code, e.g. using typedef long aliasing_long __attribute__((may_alias)) to get a long* that works like char*. But casting a ptr to a declared non volatile variable to volatile is something I hadn't read about as a recent change; do you have a link?Denmark
Let us continue this discussion in chat.Gnaw
B
1

The only purpose I can imagine is when I have a thread that changes an atomic double or float asynchronously at random points and other threads read this values asynchronously

Yes, this is the only purpose of an atomic regardless of the actual type. may it be an atomic bool, char, int, long or whatever.

Whatever usage you have for type, std::atomic<type> is a thread-safe version of it. Whatever usage you have for a float or a double, std::atomic<float/double> can be written, read or compared with a thread-safe manner.

saying that std::atomic<float/double> has only rare usages is practically saying that float/double have rare usages.

Backflow answered 4/11, 2019 at 20:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.