std::atomic in a union with another character
Asked Answered
T

1

7

I recently read some code that had an atomic and a character in the same union. Something like this

union U {
    std::atomic<char> atomic;
    char character;
};

I am not entirely sure of the rules here, but the code comments said that since a character can alias anything, we can safely operate on the atomic variable if we promise not to change the last few bits of the byte. And the character only uses those last few bytes.

Is this allowed? Can we overlay an atomic integer over a character and have them both be active? If so what happens when one thread is trying to load the value from the atomic integer and another goes and makes a write to the character (only the last few bytes), will the char write be an atomic write? What happens there? Will the cache have to be flushed for the thread that is trying to load the atomic integer?

(This code looks smelly to me too, and I am not advocating using this. Just want to understand what parts of the above scheme can be defined and under what circumstances)


As requested the code is doing something like this

// thread using the atomic
while (!atomic.compare_exchange_weak(old_value, old_value | mask, ...) { ... }

// thread using the character
character |= 0b1; // set the 1st bit or something
Toname answered 8/2, 2018 at 5:43 Comment(11)
Can you show an example of how this union is used?Novelty
I don't think I can show the exact code but I can try and get close in the question. I will editToname
@FantasticMrFox editedToname
This seems really wrong. How does each thread know if the variable is set to what they need (ie how does the char thread know it is using the union with it last being set to a char)?Novelty
@FantasticMrFox here we set the invariant that the thread using the character will only set the last few bits of the atomic. Without ever touching any other bits. No operation it will do can impact the other bits (which are being used by other threads)Toname
Every modification effectively loads the whole memory location, modifies some bits, then stores it back, potentially stepping on changes to other bits. The whole idea behind this code is wrong. Writing an answer with more detail, but see #39394350 for details on how read-modify-write atomicity actually works for operations like + or |.Ritter
@PeterCordes, which part are you saying changes other bits?Toname
@Curious: the non-atomic character |= 1 on the same memory location as the atomic. It stores an old copy of the other bits.Ritter
@PeterCordes but even though the load is non atomic, shouldn't that only be promising to change only the LSB of the character? (and note that the other thread is not changing that bit, it is just trying to change the entire byte in such a manner so as to not change that bit, it it sees that different)Toname
That's not how CPUs work. Go read #39394350. For threads not to step on each other, all threads modifying the same memory location (which ISO C++ defines carefully) must be using atomic operations.Ritter
There are two obvious ways to fix this code: 1. Use a struct instead – uses more memory, which may be bad if there are many copies. 2. Use just a std::atomic<char> – theoretically slower than 1. But I bet the author did not even benchmark it. (I.e. 1. vs 2. – there is rarely a reason to benchmark broken code, IMO.) (NOTE: 1. is still subject to false sharing, however, which affects performance.)Sleeper
R
10

the code comments said that since a character can alias anything, we can safely operate on the atomic variable if we promise not to change the last few bits of the byte.

Those comments are wrong. char-can-alias-anything doesn't stop this from being a data race on a non-atomic variable, so it's not allowed in theory, and worse, it's actually broken when compiled by any normal compiler (like gcc, clang, or MSVC) for any normal CPU (like x86).

The unit of atomicity is the memory location, not bits within a memory location. The ISO C++11 standard defines "memory location" carefully, so adjacent elements in a char[] array or a struct are separate locations (and thus it's not a race if two threads write c[0] and c[1] without synchronization). But adjacent bit-fields in a struct are not separate memory locations, and using |= on a non-atomic char aliased to the same address as an atomic<char> is definitely the same memory location, regardless of which bits are set in the right-hand side of the |=.

For a program to be free of data-race UB, if a memory location is written by any thread, all other threads that access that memory location (potentially) simultaneously must do so with atomic operations. (And probably also through the exact same object, i.e. changing the middle byte of an atomic<int> by type-punning to atomic<char> isn't guaranteed to be safe either. On most implementations on hardware similar to "normal" modern CPUs, type-punning to a different atomic type might happen to still be atomic if atomic<int/char> are both lock-free, but memory-ordering semantics might actually be broken, especially if it isn't perfectly overlapping.

Also, union type-punning in general is not allowed in ISO C++. I think you actually need to pointer-cast to char*, not make unions with char. Union type-punning is allowed in ISO C99, and as a GNU extension in GNU C89 and GNU C++, and also in some other C++ implementations.


So that takes care of the theory, but does any of this work on current CPUs? No, it's totally unsafe in practice, too.

character |= 1 will (on normal computers) compile to asm that loads the whole char, modifies the temporary, then stores the value back. On x86 this can all happen within one memory-destination or instruction if the compiler chooses to do that (which it won't if it also wants the value later). But even so, it's still a non-atomic RMW which can step on modifications to other bits.

Atomicity is expensive and optional for read-modify-write operations, and the only way to set some bits in a byte without affecting others is a read-modify-write on current CPUs. Compilers only emit asm that does it atomically if you specifically request it. (Unlike pure stores or pure loads, which are often naturally atomic. But always use std::atomic to get the other semantics you want...)

Consider this sequence of events:

 thread A           |     thread B
 -------------------|--------------
 read tmp=c=0000    |
                    |
                    |     c|=0b1100     # atomically, leaving c = 1100
 tmp |= 1 # tmp=1   |
 store c = tmp

Leaving c = 1, not the 1101 you're hoping for. i.e. the non-atomic load/store of the high bits stepped on the modification by thread B.


We get asm that can do exactly that, from compiling the source snippets from the question (on the Godbolt compiler explorer):

void t1(U &v, unsigned mask) {
    // thread using the atomic
    char old_value = v.atomic.load(std::memory_order_relaxed);
    // with memory_order_seq_cst as the default for CAS
    while (!v.atomic.compare_exchange_weak(old_value, old_value | mask)) {}

    // v.atomic |= mask; // would have been easier and more efficient than CAS
}

t1(U&, unsigned int):
    movzx   eax, BYTE PTR [rdi]            # atomic load of the old value
.L2:
    mov     edx, eax
    or      edx, esi                       # esi = mask (register arg)
    lock cmpxchg    BYTE PTR [rdi], dl     # atomic CAS, uses AL implicitly as the expected value, same semantics as C++11 comp_exg seq_cst
    jne     .L2
    ret


void t2(U &v) {
  // thread using the character
  v.character |= 0b1;   // set the 1st bit or something
}

t2(U&):
    or      BYTE PTR [rdi], 1    # NON-ATOMIC RMW of the whole byte.
    ret

It would be straightforward to write a program that ran v.character |= 1 in one thread, and an atomic v.atomic ^= 0b1100000 (or the equivalent with a CAS loop) in another thread.

If this code was safe, you'd always find that an even number of XOR operations modifying only the high bits left them zero. But you wouldn't find that, because the non-atomic or in the other thread might have stepped on an odd number of XOR operations. Or to make the problem easier to see, use addition with 0x10 or something, so instead of a 50% chance of being right by accident, you only have a 1 in 16 chance of the upper 4 bits being right.

This is pretty much exactly the same problem as losing counts when one of the increment operations is non-atomic.


Will the cache have to be flushed for the thread that is trying to load the atomic integer?

No, that's not how atomicity works. The problem isn't cache, it's that unless the CPU does something special, nothing stops other CPUs from reading or writing the location between when you load the old value and when you store the update value. You'd have the same problem on a multi-core system with no cache.

Of course, all systems do use cache, but cache is coherent so there's a hardware protocol (MESI) that stops different cores from simultaneously having conflicting values. When a store commits to L1D cache, it becomes globally visible. See Can num++ be atomic for 'int num'? for the details.

Ritter answered 8/2, 2018 at 10:27 Comment(3)
Is there any means of recognizing implementations where code which is guarded by atomic flags may pass the address of an atomic char to code that expects to access storage via "ordinary" means, at times when it's known to be impossible for any other competing accesses to take place. From what I understand, there are implementations where atomic<char> is not representation-compatible with char, and on those using an ordinary char* to access an atomic char couldn't work even if there were no contention, but on most implementations they are representation-compatible.Leisaleiser
@supercat: C++20 std::atomic_ref<T> solves most of the problem of allowing non-atomic accesses to objects when that's safe. I don't know of a portable safe way to do atomic<char> * accesses to the bytes of other atomic objects, and IDK if GNU C __atomic_load_n(char *, mem_order) is strict-aliasing safe.Ritter
On something like an 8088 or 80386 system that uses interrupts for task switching, in the absence of DMA or caching, many instructions like XOR WORD [_thing],1 will behave in atomic fashion. Whether a compiler would process thing ^= 1; using that instruction, rather than MOV AX,[_thing] / XOR AX,1 / MOV [_thing],AX would depend upon the compiler, but generating such instructions would merely having the code generation process ^= as its a distinct operator, rather than decomposing it into distinct xor and assignment operators, would be so basic as to hardly even be an "optimization".Leisaleiser

© 2022 - 2024 — McMap. All rights reserved.