the code comments said that since a character can alias anything, we can safely operate on the atomic variable if we promise not to change the last few bits of the byte.
Those comments are wrong. char
-can-alias-anything doesn't stop this from being a data race on a non-atomic variable, so it's not allowed in theory, and worse, it's actually broken when compiled by any normal compiler (like gcc, clang, or MSVC) for any normal CPU (like x86).
The unit of atomicity is the memory location, not bits within a memory location. The ISO C++11 standard defines "memory location" carefully, so adjacent elements in a char[]
array or a struct are separate locations (and thus it's not a race if two threads write c[0]
and c[1]
without synchronization). But adjacent bit-fields in a struct are not separate memory locations, and using |=
on a non-atomic char
aliased to the same address as an atomic<char>
is definitely the same memory location, regardless of which bits are set in the right-hand side of the |=
.
For a program to be free of data-race UB, if a memory location is written by any thread, all other threads that access that memory location (potentially) simultaneously must do so with atomic operations. (And probably also through the exact same object, i.e. changing the middle byte of an atomic<int>
by type-punning to atomic<char>
isn't guaranteed to be safe either. On most implementations on hardware similar to "normal" modern CPUs, type-punning to a different atomic
type might happen to still be atomic if atomic<int/char>
are both lock-free, but memory-ordering semantics might actually be broken, especially if it isn't perfectly overlapping.
Also, union type-punning in general is not allowed in ISO C++. I think you actually need to pointer-cast to char*
, not make unions with char
. Union type-punning is allowed in ISO C99, and as a GNU extension in GNU C89 and GNU C++, and also in some other C++ implementations.
So that takes care of the theory, but does any of this work on current CPUs? No, it's totally unsafe in practice, too.
character |= 1
will (on normal computers) compile to asm that loads the whole char
, modifies the temporary, then stores the value back. On x86 this can all happen within one memory-destination or
instruction if the compiler chooses to do that (which it won't if it also wants the value later). But even so, it's still a non-atomic RMW which can step on modifications to other bits.
Atomicity is expensive and optional for read-modify-write operations, and the only way to set some bits in a byte without affecting others is a read-modify-write on current CPUs. Compilers only emit asm that does it atomically if you specifically request it. (Unlike pure stores or pure loads, which are often naturally atomic. But always use std::atomic
to get the other semantics you want...)
Consider this sequence of events:
thread A | thread B
-------------------|--------------
read tmp=c=0000 |
|
| c|=0b1100 # atomically, leaving c = 1100
tmp |= 1 # tmp=1 |
store c = tmp
Leaving c
= 1, not the 1101
you're hoping for. i.e. the non-atomic load/store of the high bits stepped on the modification by thread B.
We get asm that can do exactly that, from compiling the source snippets from the question (on the Godbolt compiler explorer):
void t1(U &v, unsigned mask) {
// thread using the atomic
char old_value = v.atomic.load(std::memory_order_relaxed);
// with memory_order_seq_cst as the default for CAS
while (!v.atomic.compare_exchange_weak(old_value, old_value | mask)) {}
// v.atomic |= mask; // would have been easier and more efficient than CAS
}
t1(U&, unsigned int):
movzx eax, BYTE PTR [rdi] # atomic load of the old value
.L2:
mov edx, eax
or edx, esi # esi = mask (register arg)
lock cmpxchg BYTE PTR [rdi], dl # atomic CAS, uses AL implicitly as the expected value, same semantics as C++11 comp_exg seq_cst
jne .L2
ret
void t2(U &v) {
// thread using the character
v.character |= 0b1; // set the 1st bit or something
}
t2(U&):
or BYTE PTR [rdi], 1 # NON-ATOMIC RMW of the whole byte.
ret
It would be straightforward to write a program that ran v.character |= 1
in one thread, and an atomic v.atomic ^= 0b1100000
(or the equivalent with a CAS loop) in another thread.
If this code was safe, you'd always find that an even number of XOR operations modifying only the high bits left them zero. But you wouldn't find that, because the non-atomic or
in the other thread might have stepped on an odd number of XOR operations. Or to make the problem easier to see, use addition with 0x10
or something, so instead of a 50% chance of being right by accident, you only have a 1 in 16 chance of the upper 4 bits being right.
This is pretty much exactly the same problem as losing counts when one of the increment operations is non-atomic.
Will the cache have to be flushed for the thread that is trying to load the atomic integer?
No, that's not how atomicity works. The problem isn't cache, it's that unless the CPU does something special, nothing stops other CPUs from reading or writing the location between when you load the old value and when you store the update value. You'd have the same problem on a multi-core system with no cache.
Of course, all systems do use cache, but cache is coherent so there's a hardware protocol (MESI) that stops different cores from simultaneously having conflicting values. When a store commits to L1D cache, it becomes globally visible. See Can num++ be atomic for 'int num'? for the details.
+
or|
. – Rittercharacter |= 1
on the same memory location as the atomic. It stores an old copy of the other bits. – Ritterstruct
instead – uses more memory, which may be bad if there are many copies. 2. Use just astd::atomic<char>
– theoretically slower than 1. But I bet the author did not even benchmark it. (I.e. 1. vs 2. – there is rarely a reason to benchmark broken code, IMO.) (NOTE: 1. is still subject to false sharing, however, which affects performance.) – Sleeper