Can a bool read/write operation be not atomic on x86? [duplicate]
Asked Answered
D

3

37

Say we have two threads, one is reading a bool in a loop and another can toggle it at certain times. Personally I think this should be atomic because sizeof(bool) in C++ is 1 byte and you don't read/write bytes partially but I want to be 100% sure.

So yes or no?

EDIT:

Also for future reference, does the same apply to int?

Dumanian answered 31/1, 2013 at 11:35 Comment(10)
Isn't anything less than a word size of the underlying architecture both atomic and also less efficient than possible?Shikoku
stackoverflow.com/questions/8037289/… suggests it's non-atomic.Transference
stackoverflow.com/questions/8517969/… suggests it's atomic "in most machines".Transference
stackoverflow.com/questions/9585966/… asks the same question, but the answers stick to the C++ layer.Transference
Read the intels software developer manual, it exactly specifies under what circumstances which kind of writes/reads are atomic (e.g. if properly aligned even writes to 64bit are atomic). Note how things change if your type does not occupy all bits, that is if your bool is part of a bitfield.Ancilla
By the way, I'm not aware of any requirement in the standard that mandates sizeof(bool).Transference
@LightnessRacesinOrbit: 5.3.3 even has a note about how they are implementation defined.Ancilla
Yes, sizeof(bool) is implementation defined. I have worked on architectures where sizeof(bool) == 4.Chieftain
semi-related Can modern x86 hardware not store a single byte to memory? - in asm, byte stores are atomic without disturbing surrounding bytes. In C++ you still need std::atomic<bool> with at least memory_order_relaxed, not necessarily the default mo_seq_cst, to get safe asm.Kurr
I can easily imagine an architecture that can only access memory a word at a time, and so writing a single byte would actually be a read-modify-write operation. Of course, you wouldn't implement bool as a single byte on an architecture like that.Scintillate
R
19

It all depends on what you actually mean by the word "atomic".

Do you mean "the final value will be updated in one go" (yes, on x86 that's definitely guaranteed for a byte value - and any correctly aligned value up to 64 bits at least), or "if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that).

Ryswick answered 31/1, 2013 at 11:46 Comment(5)
"if I set this to true (or false), no other thread will read a different value after I've set it". I think the question is pretty clear. This latter interpretation isn't doesn't have anything to do with atomicity.Wellesley
@jberryman: The problem comes with caches as well as the compiler optimising the read of the memory. The b = false; in some thread, does not guarantee that all other threads, in their next case of if (b) ..., will pick up that b is false. This requires that the compiler hasn't optimised the access to b into tmp = b; ... if (tmp) ... [where tmp is a register]. Depending on the code inside the thread, there are situations when a compiler WILL do this.Ryswick
no other thread will read a different value after I've set it - mfence or a lock prefix are only needed to nail down the meaning of "after". Memory is coherent on all x86 systems, so after the store instruction eventually commits to L1d cache, no other thread can read the old value. You only need barriers to implement a seq-cst store and make sure this thread doesn't do any other loads before the store is globally visible. It definitely will become globally visible on its own very soon. Can I force cache coherency on a multicore x86 CPU?Kurr
TL:DR: a barrier doesn't explicitly flush or write-back cache, it only stalls this thread until the value commits from the store buffer to this core's L1d cache (and thus becomes globally visible.)Kurr
"if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that). - bool can have any hardware implementation, and another thread can read any, may be even "partial" state of bool. but then readed value is interpreted as true or false. so can not be any "different value". in this sense read of bool always "atomic" - we always got true or false and never something different. lock need only in case rmw operation or when we need order between this bool and other memSnowy
E
80

There are three separate issues that "atomic" types in C++11 address:

  1. tearing: a read or write involves multiple bus cycles, and a thread switch occurs in the middle of the operation; this can produce incorrect values.

  2. cache coherence: a write from one thread updates its processor's cache, but does not update global memory; a read from a different thread reads global memory, and doesn't see the updated value in the other processor's cache.

  3. compiler optimization: the compiler shuffles the order of reads and writes under the assumption that the values are not accessed from another thread, resulting in chaos.

Using std::atomic<bool> ensures that all three of these issues are managed correctly. Not using std::atomic<bool> leaves you guessing, with, at best, non-portable code.

Elene answered 31/1, 2013 at 11:53 Comment(11)
Isn't there also CPU instructions (or memory accesses) reordering at run time? A compiler may reorder loads and stores, but a CPU also can do that.Decarbonize
@RomanKruglov: on x86, only StoreLoad reordering is possible (preshing.com/20120515/memory-reordering-caught-in-the-act), so only seq-cst stores need extra ordering beyond blocking compile-time reordering. (e.g. mov+mfence, or better xchg to implement seq-cst stores.) In general on other ISAs, yes loads, stores, and RMWs may need extra barriers if they're not done with mo_relaxed.Kurr
Cache coherency is not the problem; normal systems are already coherent (using MESI or a variant). What atomic<T> actually needs to do is stop the compiler from keeping values in registers, which are thread private. (MCU programming - C++ O2 optimization breaks while loop). Also, for seq-cst stores on x86, to stall the current thread until the store becomes globally visible (e.g. by using xchg or mfence) before later loads can run. Global visibility would happen on its own, but potentially after later loads.Kurr
See also Why is integer assignment on a naturally aligned variable atomic on x86? and Can num++ be atomic for 'int num'?Kurr
See also Myths Programmers Believe about CPU Caches re: manual coherency. C++ is designed around the assumption of coherent shared memory so all you need to do is make sure the store or load actually happens in asm, not keeping a value in a register. On a hypothetical machine with non-coherent shared memory, every synchronizes-with would have to flush everything (or need a lot of tracking), but I'm not aware of any C++ implementations for standard threads with non-coherent shared memory.Kurr
I agree with your conclusion: use atomic<bool> with at least memory_order_relaxed, if not the default seq_cst. But some of your reasoning for why doesn't hold up. Point 2 is highly misleading because no real CPUs are like that.Kurr
@PeterCordes : i dont think 1 also holds up as threads are stalled until stores are committed to L1d cache . iiuc read and write on current x86 systems are from a cpp point of view atomic operationsDanicadanice
Also if atomic handles all three cases above then why does cpp reference for cv says "Even if the shared variable is atomic, it must be modified while owning the mutex to correctly publish the modification to the waiting thread." Ca someone elaborate with some examples ? @PeterCordesDanicadanice
@user179156: I think point 1 is answering the general case of std::atomic<T>, not just T=bool like the question asked about. On some ISAs, compilers might have reason to store a variable as two halves (Which types on a 64-bit computer are naturally atomic in gnu C and gnu C++? -- meaning they have atomic reads, and atomic writes has an AArch64 example of stp to store a constant with both halves the same).Kurr
@user179156: On x86, actual tearing is unlikely to be a problem in practice with normal compilers for types < register width, but you do need compiler support for 8-byte load/store in 32-bit mode (x87, MMX, SSE), or 128-bit in 64-bit mode (via movaps on CPUs with AVX, since Intel recently documented that feature bit as guaranteeing 128-bit atomicity.) And you certainly need at least volatile (better atomic) to avoid weird stuff like invented reads that could turn one atomic load into two, potentially seeing different values for the same local variable. lwn.net/Articles/793253Kurr
@user179156: Re: condition variables: IIRC, it's about avoiding a race with threads entering a sleep: they check the variable and make a system call while holding the mutex, to make sure no other thread changes the variable and does .notify_all before that thread starts to wait. So it's stuck waiting for the next notify, missing this one.Kurr
R
19

It all depends on what you actually mean by the word "atomic".

Do you mean "the final value will be updated in one go" (yes, on x86 that's definitely guaranteed for a byte value - and any correctly aligned value up to 64 bits at least), or "if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that).

Ryswick answered 31/1, 2013 at 11:46 Comment(5)
"if I set this to true (or false), no other thread will read a different value after I've set it". I think the question is pretty clear. This latter interpretation isn't doesn't have anything to do with atomicity.Wellesley
@jberryman: The problem comes with caches as well as the compiler optimising the read of the memory. The b = false; in some thread, does not guarantee that all other threads, in their next case of if (b) ..., will pick up that b is false. This requires that the compiler hasn't optimised the access to b into tmp = b; ... if (tmp) ... [where tmp is a register]. Depending on the code inside the thread, there are situations when a compiler WILL do this.Ryswick
no other thread will read a different value after I've set it - mfence or a lock prefix are only needed to nail down the meaning of "after". Memory is coherent on all x86 systems, so after the store instruction eventually commits to L1d cache, no other thread can read the old value. You only need barriers to implement a seq-cst store and make sure this thread doesn't do any other loads before the store is globally visible. It definitely will become globally visible on its own very soon. Can I force cache coherency on a multicore x86 CPU?Kurr
TL:DR: a barrier doesn't explicitly flush or write-back cache, it only stalls this thread until the value commits from the store buffer to this core's L1d cache (and thus becomes globally visible.)Kurr
"if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that). - bool can have any hardware implementation, and another thread can read any, may be even "partial" state of bool. but then readed value is interpreted as true or false. so can not be any "different value". in this sense read of bool always "atomic" - we always got true or false and never something different. lock need only in case rmw operation or when we need order between this bool and other memSnowy
S
6

x86 only guarantees word-aligned reads and writes of word size. It does not guarantee any other operations, unless explicitly atomic. Plus, of course, you have to convince your compiler to actually issue the relevant reads and writes in the first place.

Schmooze answered 31/1, 2013 at 11:46 Comment(3)
does x86 guarantees cache coherency?Columnar
@bigxiao: yes, every normal SMP system regardless of ISA guarantees cache coherency, and uses MESI (or some variant) to achieve it. Part of what atomic<T> does is stop the compiler from keeping values in registers instead of memory, because registers are thread-private. But memory is always coherent. You only need barriers if you want ordering between loads and stores, e.g. to make the current thread wait until a store is visible before doing later reads. Making a store globally visible always happens as quickly as possible regardless of barriers. (commit from store buffer to L1d)Kurr
x86 guarantees slightly more, e.g. byte load/store is always atomic, and 16-bit loads/stores that don't cross an 4-byte boundary also atomic. And dword (32-bit) aligned loads/stores are atomic. Also, on modern x86 (AMD and Intel P6 and later) cached loads/stores of any width are atomic as long as they don't cross an 8-byte boundary. Why is integer assignment on a naturally aligned variable atomic on x86? So yes, on x86 all std::atomic<> has to do for pure loads / pure stores is make sure values are naturally aligned, and not optimized away.Kurr

© 2022 - 2024 — McMap. All rights reserved.