Is incrementing an int effectively atomic in specific cases?

Asked 8/9, 2016 at 14:39 Answered 6/5, 2017 at 6:2

Solved c++c multithreading assembly atomic

188

In general, for int num, num++ (or ++num), as a read-modify-write operation, is not atomic. But I often see compilers, for example GCC, generate the following code for it (try here):

void f()
{
  int num = 0;
  num++;
}

f():
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], 0
        add     DWORD PTR [rbp-4], 1
        nop
        pop     rbp
        ret

Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?

And if so, does it mean that so-generated num++ can be used in concurrent (multi-threaded) scenarios without any danger of data races (i.e. we don't need to make it, for example, std::atomic<int> and impose the associated costs, since it's atomic anyway)?

UPDATE

Notice that the question is not whether increment is atomic (it's not and that was and is the opening line of the question). Rather, it's whether it can be in certain scenarios, i.e. whether one-instruction nature can in some cases be exploited to avoid the overhead of the lock prefix. And, as the accepted answer mentions in the section about uniprocessor machines, as well as this answer, the conversation in its comments and others explain, it can (although not with C or C++).

Inning answered 8/9, 2016 at 14:39 Comment(10)

Who told you that add is atomic? – Laryngeal 8/9, 2016 at 14:42

given that one of the features of atomics is prevention of specific kinds of reordering during optimization, no, regardless of the atomicity of the actual operation – Couvade 8/9, 2016 at 14:42

It can be your compiler sees that num++ can be optimized to that because you don't use the return value of num++. – Troop 8/9, 2016 at 14:43

it still needs to load num, increase it then write back to memory – Expansion 8/9, 2016 at 14:45

I would also like to point out that if this is atomic on your platform there is no guarantee that it will be on another pltaform. Be platform independent and express your intention by using a std::atomic<int>. – Tobytobye 8/9, 2016 at 14:45

During the execution of that add instruction, another core could steal that memory address from this core's cache and modify it. On an x86 CPU, the add instruction needs a lock prefix if the address needs to be locked in cache for the duration of the operation. – Letsou 8/9, 2016 at 17:29

It is possible for any operation to happen to be "atomic." All you have to do is get lucky and never happen to execute anything which would reveal that it is not atomic. Atomic is only valuable as a guarantee. Given that you're looking at assembly code, the question is whether that particular architecture happens to provide you the guarantee and whether the compiler provides a guarantee that that's the assembly level implementation they choose. – Colicweed 8/9, 2016 at 20:20

In this case, your function as it stands is perfectly safe, because other threads have no way to access the stack-based num variable. But that is probably not what you are asking. (It would perhaps be better if your num were a global variable.) – Sibbie 10/9, 2016 at 15:4

Appearing to work is a valid response to invoking undefined behavior. – Sherrard 15/4, 2021 at 3:27

@Lmn: SO questions should use code-blocks for code, not pictures of text. The compactness of the image was nice, but there are enough reasons not to use images that you should let Joseph's edit stand. (Blind users with screen-readers, corporate firewalls that block images, etc.) In other cases, search engines being able to search the code is relevant; not so much here but the other reasons are sufficient. – Humiliating 20/4, 2021 at 2:17

260

This is absolutely what C++ defines as a Data Race that causes Undefined Behaviour, even if one compiler happened to produce code that did what you hoped on some target machine. You need to use std::atomic for reliable results, but you can use it with memory_order_relaxed if you don't care about ordering of this operation wrt. other operations in this thread on other variables. See below for some example code and asm output using fetch_add.

The fact that data races on non-atomic variables are UB is what lets C++ compilers still optimize plain int aggressively, making fast code for variables that aren't shared (and for single-threaded programs).

But first, the assembly language part of the question:

Since num++ is one instruction (add dword [num], 1), can we conclude that num++ is atomic in this case?

Memory-destination instructions (other than pure stores) are read-modify-write operations that happen in multiple internal steps. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc., etc.

Memory operations from other CPUs can become globally visible between the load and store. I.e. two threads running add dword [num], 1 in a loop would step on each other's stores. (See @Margaret's answer for a nice diagram). After 40k increments from each of two threads, the counter might have only gone up by ~60k (not 80k) on real multi-core x86 hardware.

"Atomic", from the Greek word meaning indivisible, means that no observer can see the operation as separate steps. Happening physically / electrically instantaneously for all bits simultaneously is just one way to achieve this for a load or store, but that's not even possible for an ALU operation. I went into a lot more detail about pure loads and pure stores in my answer to Atomicity of loads and stores on x86, while this answer focuses on read-modify-write.

The lock prefix can be applied to many read-modify-write (memory destination) instructions to make the entire operation atomic with respect to all possible observers in the system (other cores and DMA devices, not an oscilloscope hooked up to the CPU pins). That is why it exists. (See also this Q&A).

So lock add dword [num], 1 is atomic. A CPU core running that instruction would keep the cache line pinned in Modified state in its private L1 cache from when the load reads data from cache until the store commits its result back into cache. This prevents any other cache in the system from having a copy of the cache line at any point from load to store, according to the rules of the MESI cache coherency protocol (or the MOESI/MESIF versions of it used by multi-core AMD/Intel CPUs, respectively). Thus, operations by other cores appear to happen either before or after, not during.

Without the lock prefix, another core could take ownership of the cache line and modify it after our load but before our store, so that other store would become globally visible in between our load and store. Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches.

(If a locked instruction operates on memory that spans two cache lines, it takes a lot more work to make sure the changes to both parts of the object stay atomic as they propagate to all observers, so no observer can see tearing. The CPU might have to lock the whole memory bus until the data hits memory. Don't misalign your atomic variables!)

Note that the lock prefix also turns an instruction into a full memory barrier (like MFENCE), stopping all run-time reordering and thus giving sequential consistency. (See Jeff Preshing's excellent blog post. His other posts are all excellent, too, and clearly explain a lot of good stuff about lock-free programming, from x86 and other hardware details to C++ rules.)

On a uniprocessor machine, or in a single-threaded process, a single RMW instruction actually is atomic without a lock prefix. The only way for other code to access the shared variable is for the CPU to do a context switch, which can't happen in the middle of an instruction. So a plain dec dword [num] can synchronize between a single-threaded program and its signal handlers, or in a multi-threaded program running on a single-core machine. See the second half of my answer on another question, and the comments under it, where I explain this in more detail.

Back to C++:

It's totally bogus to use num++ without telling the compiler that you need it to compile to a single read-modify-write implementation:

;; Valid compiler output for num++
mov   eax, [num]
inc   eax
mov   [num], eax

This is very likely if you use the value of num later: the compiler will keep it live in a register after the increment. So even if you check how num++ compiles on its own, changing the surrounding code can affect it.

(If the value isn't needed later, inc dword [num] is preferred; modern x86 CPUs will run a memory-destination RMW instruction at least as efficiently as using three separate instructions. Fun fact: gcc -O3 -m32 -mtune=i586 will actually emit this, because (Pentium) P5's superscalar pipeline didn't decode complex instructions to multiple simple micro-operations the way P6 and later microarchitectures do. See the Agner Fog's instruction tables / microarchitecture guide for more info, and the x86 tag wiki for many useful links (including Intel's x86 ISA manuals, which are freely available as PDF)).

Don't confuse the target memory model (x86) with the C++ memory model

Compile-time reordering is allowed. The other part of what you get with std::atomic is control over compile-time reordering, to make sure your num++ becomes globally visible only after some other operation.

Classic example: Storing some data into a buffer for another thread to look at, then setting a flag. Even though x86 does acquire loads/release stores for free, you still have to tell the compiler not to reorder by using flag.store(1, std::memory_order_release);.

You might be expecting that this code will synchronize with other threads:

// int flag;  is just a plain global, not std::atomic<int>.
flag--;           // Pretend this is supposed to be some kind of locking attempt
modify_a_data_structure(&foo);    // doesn't look at flag, and the compiler knows this.  (Assume it can see the function def).  Otherwise the usual don't-break-single-threaded-code rules come into play!
flag++;

But it won't. The compiler is free to move the flag++ across the function call (if it inlines the function or knows that it doesn't look at flag). Then it can optimize away the modification entirely, because flag isn't even volatile.

(And no, C++ volatile is not a useful substitute for std::atomic. std::atomic does make the compiler assume that values in memory can be modified asynchronously similar to volatile, but there's much more to it than that. (In practice there are similarities between volatile int to std::atomic with mo_relaxed for pure-load and pure-store operations, but not for RMWs). Also, volatile std::atomic<int> foo is not necessarily the same as std::atomic<int> foo, although current compilers don't optimize atomics (e.g. 2 back-to-back stores of the same value) so volatile atomic wouldn't change the code-gen.)

Defining data races on non-atomic variables as Undefined Behaviour is what lets the compiler still hoist loads and sink stores out of loops, and many other optimizations for memory that multiple threads might have a reference to. (See this LLVM blog for more about how UB enables compiler optimizations.)

As I mentioned, the x86 lock prefix is a full memory barrier, so using num.fetch_add(1, std::memory_order_relaxed); generates the same code on x86 as num++ (the default is sequential consistency), but it can be much more efficient on other architectures (like ARM). Even on x86, relaxed allows more compile-time reordering.

This is what GCC actually does on x86, for a few functions that operate on a std::atomic global variable.

See the source + assembly language code formatted nicely on the Godbolt compiler explorer. You can select other target architectures, including ARM, MIPS, and PowerPC, to see what kind of assembly language code you get from atomics for those targets.

#include <atomic>
std::atomic<int> num;
void inc_relaxed() {
  num.fetch_add(1, std::memory_order_relaxed);
}

int load_num() { return num; }            // Even seq_cst loads are free on x86
void store_num(int val){ num = val; }
void store_num_release(int val){
  num.store(val, std::memory_order_release);
}
// Can the compiler collapse multiple atomic operations into one? No, it can't.

# g++ 6.2 -O3, targeting x86-64 System V calling convention. (First argument in edi/rdi)
inc_relaxed():
    lock add        DWORD PTR num[rip], 1      #### Even relaxed RMWs need a lock. There's no way to request just a single-instruction RMW with no lock, for synchronizing between a program and signal handler for example. :/ There is atomic_signal_fence for ordering, but nothing for RMW.
    ret
inc_seq_cst():
    lock add        DWORD PTR num[rip], 1
    ret
load_num():
    mov     eax, DWORD PTR num[rip]
    ret
store_num(int):
    mov     DWORD PTR num[rip], edi
    mfence                          ##### seq_cst stores need an mfence
    ret
store_num_release(int):
    mov     DWORD PTR num[rip], edi
    ret                             ##### Release and weaker doesn't.
store_num_relaxed(int):
    mov     DWORD PTR num[rip], edi
    ret

Notice how MFENCE (a full barrier) is needed after a sequential-consistency stores. x86 is strongly ordered in general, but StoreLoad reordering is allowed. Having a store buffer is essential for good performance on a pipelined out-of-order CPU. Jeff Preshing's Memory Reordering Caught in the Act shows the consequences of not using MFENCE, with real code to show reordering happening on real hardware.

Re: discussion in comments on @Richard Hodges' answer about compilers merging std::atomic num++; num-=2; operations into one num--; instruction:

A separate Q&A on this same subject: Why don't compilers merge redundant std::atomic writes?, where my answer restates a lot of what I wrote below.

Current compilers don't actually do this (yet), but not because they aren't allowed to. C++ WG21/P0062R1: When should compilers optimize atomics? discusses the expectation that many programmers have that compilers won't make "surprising" optimizations, and what the standard can do to give programmers control. N4455 discusses many examples of things that can be optimized, including this one. It points out that inlining and constant-propagation can introduce things like fetch_or(0) which may be able to turn into just a load() (but still has acquire and release semantics), even when the original source didn't have any obviously redundant atomic ops.

The real reasons compilers don't do it (yet) are: (1) nobody's written the complicated code that would allow the compiler to do that safely (without ever getting it wrong), and (2) it potentially violates the principle of least surprise. Lock-free code is hard enough to write correctly in the first place. So don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much. It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

Getting back to num++; num-=2; compiling as if it were num--: Compilers are allowed to do this, unless num is volatile std::atomic<int>. If a reordering is possible, the as-if rule allows the compiler to decide at compile time that it always happens that way. Nothing guarantees that an observer could see the intermediate values (the num++ result).

I.e. if the ordering where nothing becomes globally visible between these operations is compatible with the ordering requirements of the source (according to the C++ rules for the abstract machine, not the target architecture), the compiler can emit a single lock dec dword [num] instead of lock inc dword [num] / lock sub dword [num], 2.

num++; num-- can't disappear, because it still has a Synchronizes With relationship with other threads that look at num, and it's both an acquire-load and a release-store which disallows reordering of other operations in this thread. For x86, this might be able to compile to an MFENCE, instead of a lock add dword [num], 0 (i.e. num += 0).

As discussed in PR0062, more aggressive merging of non-adjacent atomic ops at compile time can be bad (e.g. a progress counter only gets updated once at the end instead of every iteration), but it can also help performance without downsides (e.g. skipping the atomic inc / dec of ref counts when a copy of a shared_ptr is created and destroyed, if the compiler can prove that another shared_ptr object exists for entire lifespan of the temporary.)

Even num++; num-- merging could hurt fairness of a lock implementation when one thread unlocks and re-locks right away. If it's never actually released in the asm, even hardware arbitration mechanisms won't give another thread a chance to grab the lock at that point.

With current gcc6.2 and clang3.9, you still get separate locked operations even with memory_order_relaxed in the most obviously optimizable case. (Godbolt compiler explorer so you can see if the latest versions are different.)

void multiple_ops_relaxed(std::atomic<unsigned int>& num) {
  num.fetch_add( 1, std::memory_order_relaxed);
  num.fetch_add(-1, std::memory_order_relaxed);
  num.fetch_add( 6, std::memory_order_relaxed);
  num.fetch_add(-5, std::memory_order_relaxed);
  //num.fetch_add(-1, std::memory_order_relaxed);
}

multiple_ops_relaxed(std::atomic<unsigned int>&):
    lock add        DWORD PTR [rdi], 1
    lock sub        DWORD PTR [rdi], 1
    lock add        DWORD PTR [rdi], 6
    lock sub        DWORD PTR [rdi], 5
    ret

Humiliating answered 8/9, 2016 at 17:30 Comment(37)

"[using separate instructions] used to be more efficient ... but modern x86 CPUs once again handle RMW operations at least as efficiently" -- it still is more efficient in the case where the updated value will be used later in the same function and there's a free register available for the compiler to store it in (and the variable isn't marked volatile, of course). This means that it is highly likely that whether the compiler generates a single instruction or multiple for the operation depends on the rest of the code in the function, not just the single line in question. – Canvass 8/9, 2016 at 17:44

@PeriataBreatta: yes, good point. In asm you could use mov eax, 1 xadd [num], eax (with no lock prefix) to implement post-increment num++, but that's not what compilers do. – Humiliating 8/9, 2016 at 18:49

@PeterCordes, where you write "...not that it physically / electrically happened simultaneously", did you mean to say "... instantaneously"? – Inning 8/9, 2016 at 20:3

Good God man, an exegesis on atomicity -- well done! Consider marking for the community? – Krysta 8/9, 2016 at 20:10

@LeoHeinsaar: by simultaneously, I meant that all 32 bits were signalled simultaneously on a parallel bus. I wrote this quickly so I could get it posted while people were still looking at the question. I had another significant edit about to submit, but clicking inside the text box somehow refreshed my window with the 3rd-party edit and lost my changes :( Not Peter's Mortensen's fault, of course: it's SO's fault. Thanks for the edit. Although I did capitalize C++ Undefined Behaviour and Data Race on purpose as if they were proper nouns, rather than just the literal meaning of the phrase. – Humiliating 8/9, 2016 at 20:18

@PeterCordes Ah ok, that simultaneously. Sad to hear about the lost changes, I hope you'll still add them. – Inning 8/9, 2016 at 20:21

@DavidC.Rankin: If you have any edits you'd like to make, feel free. I don't want to make this CW, though. It's still my work (and my mess :P). I'll tidy up some after my Ultimate [frisbee] game :) – Humiliating 8/9, 2016 at 20:21

Oh no, exegesis (defined as a critical interpretation of text) is a big complement on what you wrote. (along with the upvote) I couldn't begin to improve on it. Sorry if it wasn't understood that way. – Krysta 8/9, 2016 at 20:27

@DavidC.Rankin: I understood the first part, thanks :) But I thought you meant mark it community wiki. Was that not it? – Humiliating 8/9, 2016 at 23:10

If not community wiki, then maybe a link on the appropriate tag wiki. (both the x86 and atomic tags?). It's worth additional linkage rather than a hopeful return by a generic search on S.O. (If I knew better where it should fit in that regard, I'd do it. I'll have to dig further into the do's & don't's of tag wiki linkage) – Krysta 9/9, 2016 at 2:9

@DavidC.Rankin: Sounds like a good plan. I try to resist adding my own answers to tag wikis, since many of the SO questions linked from the x86 wiki are already my answers. (I wrote that tag wiki, and looking for other answers would have been more work when I know I'd already written an answers that attempted to be canonical and something you might link from a tag wiki.) But with your endorsement that this is worth pointing people at, I will happily link this in the lock-free, stdatomic, and maybe other relevant tag wikis :) – Humiliating 9/9, 2016 at 2:17

@PeterCordes excellent answer, this is the type of contributions I'm happy to read and to learn from. Thank you. – Tagmemic 9/9, 2016 at 7:46

"The compiler is free to move the flag++ across the function call" : flag++ has a side effect -- shouldn't it always be sequenced after the function call? – Spatola 9/9, 2016 at 10:42

@Spencer: The "as-if" rule (and the fact that Data Races on non-atomic objects are UB) allows it to do this in the special case I specified: (if it inlines the function or knows that it doesn't look at flag). Sequence points aren't the same thing as the Synchronizes With relationship. With no Synchronizes With operations (e.g. atomics or std::mutex), the compiler can do anything that doesn't change the result of single-threaded execution. – Humiliating 9/9, 2016 at 22:43

@LeoHeinsaar: I finally got around to making the edits I was thinking of. The last section is so long that I should really post it as a separate self-Q&A about whether compilers can optimize atomics :P – Humiliating 11/10, 2016 at 23:46

@DavidC.Rankin: I put links to this in the x86, lock-free, and stdatomic tag wikis. Those are all wikis I've edited before. (The large x86 tag wiki is at least 95% my work). IDK if it's appropriate for the C++ tag wiki, since none of its FAQ questions are about stdatomic. – Humiliating 11/10, 2016 at 23:51

As always - great answer! Good distinction between coherence and atomicity (where some others got it wrong) – Katushka 14/11, 2016 at 6:19

In case of the atomic instruction, what is the reason the cacheline is acquired in modified state? Why isn't it sufficient to acquire it in exclusive state? What if the atomic operation didn't make a change? Excellent post btw. – Demand 8/5, 2020 at 13:27

@pveentjer: AFAIK, an atomic RMW always changes the cache line to Modified, even a "failed" lock cmpxchg. The manual still talks about legacy #LOCK external signal pin semantics like The processor never produces a locked read without also producing a locked write. so IDK. For any other store (including the store part of a locked instruction), see What specifically marks an x86 cache line as dirty - any write, or is an explicit change required? re: the fact that hardware doesn't do "silent store" optimization – Humiliating 8/5, 2020 at 13:33

If I understand it correctly; in theory the cacheline could start out as exclusive and then be upgraded to modified. In practice nobody this this for practical reasons, but not for correctness reasons. Correct? – Demand 8/5, 2020 at 13:42

@pveentjer: Yes, the RFO will bring it in into Exclusive state, it only becomes Modified when a store commits to it. This will happen basically right away if a store is at the end of the SB waiting to commit when the line arrives. I didn't re-read my answer but I assume I glossed over this detail. – Humiliating 8/5, 2020 at 13:51

You should consider writing a book on this topic. Most books are very abstract (so not specific to X86). Or focus too much on the programming aspect of Assembly end hence ignore CPU design. And performance oriented books are way too lightweight on this topic. There is AFAIK no cohesive book on this topic. – Demand 8/5, 2020 at 14:0

One more question about atomic instructions. Atomic instructions are non blocking because once the cacheline is locked, no matter if the thread is suspended, the update operation will complete and release the locked cacheline. But there are 3 flavors of non blocking algorithms; wait free, lock free and obstruction free. If I remember correctly, atomic instructions are wait free. But what guarantees that every atomic instruction completes in a finite number of steps? What prevents one CPU from being starved? – Demand 8/5, 2020 at 14:11

@pveentjer: You should repost that as a separate question, so I can post these comments as an answer. I think it's a good enough question to be worth asking & answering. Only the fairness of hardware arbitration for exclusive ownership of cache lines prevents starvation. Wait-free vs. lock-free analysis in terms of the number of software steps is somewhat silly when the time for one software step varies tremendously because only one thread can "own" the cache line at any one time. – Humiliating 8/5, 2020 at 15:24

@pveentjer: In normal use-cases (not every core hammering on the line at once), there's not much difference between a CAS or LL/SC retry loop (lock-free) which very rarely has to retry vs. a HW-supported atomic increment. But in the extreme contention case, HW-arbitrated atomics avoid wasting time on accesses that make no progress. See anandtech.com/show/15578/… which benchmarks that extreme contention case on ARM Graviton2 with LL/SC vs. ARMv8.1 true CAS showing a big gain. – Humiliating 8/5, 2020 at 15:24

@PeterCordes great answer! Wanted to follow up on this: > "Several other answers get this wrong, and claim that without lock you'd get conflicting copies of the same cache line. This can never happen in a system with coherent caches." This statement seems to contradict this whitepaper. In 4.1 - 4.3, this paper states that without memory barriers CPU cores can actually read stale (incoherent) cache lines because of the invalidate queues. Can you comment on this? – Tims 4/1, 2021 at 7:39

@МаксФедотов: I've never seriously tried to grok the "invalidate queue" mental model of how things work. I think it's just another way of describing the fact that a CPU core can execute a store at any time, but can't commit data from its store buffer to coherent L1d cache until it receives an answer to its RFO (read for ownership). The store buffer isn't cache; it's private to the CPU core. It's what causes StoreLoad reordering (and with store forwarding, other weird effects). See Can a speculatively executed CPU branch contain opcodes that access RAM? – Humiliating 4/1, 2021 at 7:44

@МаксФедотов: I think one important benefit of the way I prefer to describe it is that it comes clear that all memory reordering is within each CPU core separately, in how it orders its accesses to coherent shared memory. Memory barriers are local and just order a core's accesses to its own L1d (which is coherent). However, PowerPC needs a more complex model because store-forwarding of retired stores is possible across SMT threads within a physical core; it's not multi-copy atomic. This allows IRIW reordering. – Humiliating 4/1, 2021 at 7:48

@МаксФедотов: Ok, I had a look at that doc since you did quote a section number. It is talking about store buffers and exclusive ownership, and the invalidate queue stuff seems to be about load ordering. I think you're talking about the example in section 4.3. Yeah, of course you need a barrier in the read side to make sure you observe coherent cache in the correct order. I'm not sure what the advantage of this "invalidate queue" mental model is vs. a CPU pipeline that can hit-under-miss in cache (and a weakly-ordered memory model unlike x86 that allows it to make that visible). – Humiliating 4/1, 2021 at 7:56

@PeterCordes thanks for taking the time to answer. My point was that it seems that without memory barriers the so-called cache coherence protocols can actually leave cache in an incoherent state. That I need to use memory barriers to achieve the ordering I need is obvious to me. What was a nasty surprise is the fact that I also need to use memory barriers so unsure cache coherence on the read side of things. – Tims 4/1, 2021 at 8:8

@МаксФедотов: Ok, I see. No, that's not true. Without read-side memory barriers, you're just observing coherent global order in an order other than program order. I'm pretty sure I understand enough about CPUs with coherent caches to say that a core of an SMP system using MESI can't actually modify its L1d cache until it has exclusive ownership of that line (no other valid copies of the line in the system). Any implication to the contrary is a problem with the "invalidation queue" way of explaining things (or a misunderstanding of it?), not a violation of cache coherence. – Humiliating 4/1, 2021 at 8:12

@PeterCordes I'm sorry to take your time. Problem is - either the white paper is wrong, or MESI-like protocols allow for some cache incoherence. The example in 4.3 clearly demonstrates that you can have exclusive ownership of the cache line and later modify it, and after modification another CPU may read from cache its stale copy of the same line. That for me is the definition of incoherence - 2 different and "valid" copies of the same cache line. – Tims 4/1, 2021 at 8:20

@МаксФедотов: but is the other core truly reading from cache after invalidation, or is it using the result of a load that actually took data from coherent cache earlier, while the line was still valid? (Accurately defining time and "before" / "after" are a problem because messages take time to propagate...) Is that interpretation enough to reconcile Paul McKenney's whitepaper with standard MESI + OoO exec? (I've never seen any CPU-architecture deep dive quote a size for anything called an "invalidation queue", only for load and store buffer sizes, ROB, RS, and # of outstanding L1/L2 reqs) – Humiliating 4/1, 2021 at 8:34

@PeterCordes the white paper states that it really reads from cache. That is allowed because of those pesky invalidate queues which guaranty only that no further MESI messages will be sent regarding this cache line (so loads are allowed because cache line is still valid for some time). "However, the CPU need not actually invalidate the cache line before sending the acknowledgement. It could instead queue the invalidate message with the understanding that the message will be processed before the CPU sends any further messages regarding that cache line." – Tims 4/1, 2021 at 8:45

@PeterCordes I've taken enough of your time already. I thought that maybe you would be able to definitively prove that this white paper is wrong, or that this paper will be of interest to you otherwise. In any case it seems that our discussion is already too long for comments. :) – Tims 4/1, 2021 at 8:49

@МаксФедотов: Ok, interesting, thanks for finding that exact wording. I guess that's a possible hardware design, but I'm not sure real CPUs are designed that way. It sounds to me like an alternate mental model for OoO early exec of loads. I'm not in a position to say for sure; might make an interesting SO question on its own, if there isn't already a duplicate. Feel free to quote or paraphrase any of my comments here if you want to ask such a question. I've never seen a CPU vendor advertize a new version having a larger "invalidate queue" for better memory parallelism, only load buffers. – Humiliating 4/1, 2021 at 8:52

@Max: For purposes of ordering, is atomic read-modify-write one operation or two? is somewhat related, and made me realize a problem with one of my suggestions here (if I ever included it in a comment?): having the store side of an LL/SC atomic RMW bypass the store buffer and access L1d directly would require it to be non -speculative. Draining the OoO back-end seems unlikely. And to be fair, I've mostly looked at microarchitecture details for x86 which only has single-instruction RMWs, not LL/SC like ARM and POWER. (Paul McKenny works for IBM.) – Humiliating 5/1, 2021 at 2:5

Without many complications an instruction like add DWORD PTR [rbp-4], 1 is very CISC-style.

It perform three operations: load the operand from memory, increment it, store the operand back to memory.
During these operations the CPU acquire and release the bus twice, in between any other agent can acquire it too and this violates the atomicity.

AGENT 1          AGENT 2

load X              
inc C
                 load X
                 inc C
                 store X
store X

X is incremented only once.

Ramirez answered 8/9, 2016 at 15:14 Comment(5)

@LeoHeinsaar In order for that to be the case, each memory chip would need its own Arithmetic Logic Unit (ALU). It would, in effect, require that each memory chip was a processor. – Mcfall 8/9, 2016 at 15:26

@LeoHeinsaar: memory-destination instructions are read-modify-write operations. No architectural register is modified, but the CPU has to hold the data internally while it sends it through its ALU. The actual register file is only a small part of the data storage inside even the simplest CPU, with latches holding outputs of one stage as inputs for another stage, etc. etc. – Humiliating 8/9, 2016 at 16:0

@PeterCordes Your comment is exactly the answer I was looking for. Margaret's answer made me suspect that something like that must go on inside. – Inning 8/9, 2016 at 16:4

Turned that comment into a full answer, including addressing the C++ part of the question. – Humiliating 8/9, 2016 at 17:32

@PeterCordes Thanks, very detailed and on all points. It was obviously a data race and therefore undefined behavior by the C++ standard, I was just curious whether in cases where the generated code was what I posted one could assume that that could be atomic etc etc. I also just checked that at least Intel developer manuals very clearly define atomicity with respect to memory operations and not instruction indivisibility, as I assumed: "Locked operations are atomic with respect to all other memory operations and all externally visible events." – Inning 8/9, 2016 at 17:46

...and now let's enable optimisations:

f():
        rep ret

OK, let's give it a chance:

void f(int& num)
{
  num = 0;
  num++;
  --num;
  num += 6;
  num -=5;
  --num;
}

result:

f(int&):
        mov     DWORD PTR [rdi], 0
        ret

another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes.

compare to:

#include <atomic>

void f(std::atomic<int>& num)
{
  num = 0;
  num++;
  --num;
  num += 6;
  num -=5;
  --num;
}

where the result is:

f(std::atomic<int>&):
        mov     DWORD PTR [rdi], 0
        mfence
        lock add        DWORD PTR [rdi], 1
        lock sub        DWORD PTR [rdi], 1
        lock add        DWORD PTR [rdi], 6
        lock sub        DWORD PTR [rdi], 5
        lock sub        DWORD PTR [rdi], 1
        ret

Now, each modification is:-

observable in another thread, and
respectful of similar modifications happening in other threads.

atomicity is not just at the instruction level, it involves the whole pipeline from processor, through the caches, to memory and back.

Further info

Regarding the effect of optimisations of updates of std::atomics.

The c++ standard has the 'as if' rule, by which it is permissible for the compiler to reorder code, and even rewrite code provided that the outcome has the exact same observable effects (including side-effects) as if it had simply executed your code.

The as-if rule is conservative, particularly involving atomics.

consider:

void incdec(int& num) {
    ++num;
    --num;
}

Because there are no mutex locks, atomics or any other constructs that influence inter-thread sequencing, I would argue that the compiler is free to rewrite this function as a NOP, eg:

void incdec(int&) {
    // nada
}

This is because in the c++ memory model, there is no possibility of another thread observing the result of the increment. It would of course be different if num was volatile (might influence hardware behaviour). But in this case, this function will be the only function modifying this memory (otherwise the program is ill-formed).

However, this is a different ball game:

void incdec(std::atomic<int>& num) {
    ++num;
    --num;
}

num is an atomic. Changes to it must be observable to other threads that are watching. Changes those threads themselves make (such as setting the value to 100 in between the increment and decrement) will have very far-reaching effects on the eventual value of num.

Here is a demo:

#include <thread>
#include <atomic>

int main()
{
    for (int iter = 0 ; iter < 20 ; ++iter)
    {
        std::atomic<int> num = { 0 };
        std::thread t1([&] {
            for (int i = 0 ; i < 10000000 ; ++i)
            {
                ++num;
                --num;
            }
        });
        std::thread t2([&] {
            for (int i = 0 ; i < 10000000 ; ++i)
            {
                num = 100;
            }
        });
        
        t2.join();
        t1.join();
        std::cout << num << std::endl;
    }
}

sample output:

Mcfall answered 8/9, 2016 at 14:55 Comment(24)

SO must have the same bug. I have hit the up vote multiple times but it only shows it happened once :(. Great answer and it is nice to show how it can all come apart on you. – Tobytobye 8/9, 2016 at 15:10

This fails to explain that add dword [rdi], 1 is not atomic (without the lock prefix). The load is atomic, and the store is atomic, but nothing stops another thread from modifying the data between the load and the store. So the store can step on a modification made by another thread. See jfdube.wordpress.com/2011/11/30/understanding-atomic-operations. Also, Jeff Preshing's lock-free articles are extremely good, and he does mention the basic RMW problem in that intro article. – Humiliating 8/9, 2016 at 15:56

I understand the observability issues. I was just curious about a very simple possible case (and assuming that no optimizations were made - the example was obviously optimizable just to keep it simple). Also, by atomicity I meant that the CPU cannot be preempted before the operation takes effect. – Inning 8/9, 2016 at 16:0

@LeoHeinsaar which hyperthread of which core of which CPU? And what about the bus? And does this CPU allow interrupts to pre-empt instruction completion (and pipeline flushing). There is no such thing as an instruction really - it's just a sequence of sub-operations that get pushed into an operation queue. – Mcfall 8/9, 2016 at 16:9

"another observing thread (even ignoring cache synchronisation delays) has no opportunity to observe the individual changes" - is that actually a problem? Even with a std::atomic<int>&, I thought the compiler was free to merge all those operations into one. – Lines 8/9, 2016 at 16:45

@Lines Yah, since there's no guarantee that any other thread would ever be able to observe the intermediate modifications the compiler should be able to optimize the third example the same way it optimized the second example. The code would behave "as if" other threads never happen to be observing the changes at the right time. – Satiety 8/9, 2016 at 17:50

@Lines it can only do this if it can prove that no other thread can possibly observe the atomic variable. In the case of the variable-as-reference, there is no way to prove that there is no other observer (or writer). – Mcfall 8/9, 2016 at 18:15

@RichardHodges: Is there some mechanism I'm not aware of that would let another thread ensure it sees every value num takes? I would think that it could just merge the operations and behave as-if other readers and writers are never looking at a bad time. – Lines 8/9, 2016 at 18:29

@user2357112: I agree, I think the compiler should be allowed to do that, even with the default (sequential_consistency) memory ordering requirement. Having two atomic ops on the same variable not be separated by any loads or stores from other threads is one of the possible orderings that the C++ standard allows, so the compiler is allowed to pick it at compile time (like Ross said). Richard is wrong here; there's nothing an observer could do that will guarantee they see an intermediate state (in pure C++, i.e. excluding debugger-style watch-points). I added a section to my answer about this. – Humiliating 8/9, 2016 at 18:42

What's really going on here is that nobody's implemented this optimization in gcc, because it would be nearly useless and probably more dangerous than helpful. (Principle of least surprise. Maybe someone is expecting a temporary state to be visible sometimes, and are ok with the statistical probabilty. Or they are using hardware watch-points to interrupt on modification.) lock-free code needs to be carefully crafted, so there won't be anything to optimize. It might be useful to look for it and print a warning, to alert the coder that their code might not mean what they think! – Humiliating 8/9, 2016 at 18:46

@PeterCordes I respectfully disagree with you. A spinlock is a structure in which both threads must be able to observe a change to the variable. If the compiler saw a spinlock being used in a loop, we would not want it to assume that its value never changes in an observable way. The compiler absolutely must not elide the individual reads and writes. Another example: what happens if another thread atomically clears num half way though the sequence of of increments and decrements? That change must be observed by our function or it will change observable behaviour. – Mcfall 8/9, 2016 at 18:55

@RossRidge I don't think that's true. Consider, say, implementing Peterson's algorithm. If you were right, the compiler might look at the code that effectively states flag[0] = true; ...; flag[0] = false; and determine it could just elide the change to flag[0] because there might not be any other thread reading it between the changes. That's clearly not true, but the analysis that would be required to prove that isn't true is beyond anything I've seen in any compiler before. Therefore, compilers must assume that changes to ... – Canvass 8/9, 2016 at 19:9

... atomically-accessed objects are observable in other threads at any point, otherwise they will break common algorithms in ways that would effectively render them useless. – Canvass 8/9, 2016 at 19:10

@PeterCordes demonstration program added under Further Info in my answer. – Mcfall 8/9, 2016 at 19:13

@PeriataBreatta: The key difference is the ... in Peterson's algorithm, which also modifies globally visible state (turn=1 and the critical section the lock is protecting). We're only talking about merging when there's literally nothing them: num+=1; num-=2. Richard's update only shows how gcc chose to compile it, and doesn't prove that it's required. The code has a race condition. Not a C++-UB data race, just the garden variety race that means the results are unpredictable. 100 is a legal output, and there's nothing that requires that some runs might produce 99. – Humiliating 8/9, 2016 at 19:21

@PeterCordes and I'm sure that this would change observable behaviour in another thread, which might be waiting to see the value go from 0 to 1, even for a moment. – Mcfall 8/9, 2016 at 19:28

That's perhaps a reason for compilers not to implement this (principle of least surprise and so on). Observing that would be possible in practice on real hardware. However, the C++ memory ordering rules don't say anything about any guarantee that one thread's loads mix "evenly" with other thread's ops in the C++ abstract machine. I still think it would be legal, but programmer-hostile. – Humiliating 8/9, 2016 at 20:5

Thought experiment: Consider a C++ implementation on a cooperative multi-tasking system. It implements std::thread by inserting yield points where needed to avoid deadlocks, but not between every instruction. I guess you'd argue that something in the C++ standard requires a yield point between num++ and num--. If you can find a section in the standard that requires that, it would settle this. I'm pretty sure it only requires that no observers can ever see a wrong reordering, which doesn't require a yield there. So I think it's just a quality-of-implementation issue. – Humiliating 8/9, 2016 at 20:12

@PeterCordes 29.6.5 talks about memory being affected. It seems to me that this argues that the read/write must actually happen. However, pedantic as it may be in some circumstances, the standard often leaves important details unsaid. – Mcfall 8/9, 2016 at 21:52

n4140 29.6.5. 2.4 notes that "operations on non-volatile objects may be merged under some conditions". You can use volatile atomic<int> to guarantee the semantics you expect. There are volatile and non-volatile versions of all atomic functions. This doesn't necessarily imply that plain atomic<> objects don't have semantics like volatile, but the comment about merging seems to confirm my interpretation. 1.10.23 says: In fact, most single-threaded program transformations continue to be allowed, since any program that behaves differently as a result must perform an undefined operation. – Humiliating 9/9, 2016 at 4:35

@PeterCordes I fear we are entering a standards "corner case". The effect of volatile is entirely implementation-defined, so we cannot talk reasonably about using volatile to ensure that a write happens in all implementations. In addition, I don't think my random-number generator (above) invokes any undefined behaviour. I think it's reasonable to expect (demand) that there is no guarantee of the same number coming out twice. But we're probably boring people. I'll ask on the std c++ forum. – Mcfall 9/9, 2016 at 10:21

Let us continue this discussion in chat. – Humiliating 9/9, 2016 at 11:53

For the sake of finality, I asked on the std discussion mailing list. This question turned up 2 papers which seem to both concur with Peter, and address concerns that I have about such optimisations: wg21.link/p0062 and wg21.link/n4455 My thanks to Andy who brought these to my attention. – Mcfall 9/9, 2016 at 13:46

@RichardHodges: The Standard is fine if compiler writers use common sense trying to implement things in the most useful fashion, rather than necessarily the most "efficient". If a compiler unrolls a loop 4x, consolidating groups of four updates would be reasonable and useful. If the compiler can estimate how long the loop will take, using counter in a register to limit updates to a rate that would limit the amount of time doing the updates to e.g. 5% of the loop time may be useful. Consolidating all updates within a six-hour loop until the end, however, not so useful. – Adorl 9/9, 2016 at 20:47

The add instruction is not atomic. It references memory, and two processor cores may have different local cache of that memory.

IIRC the atomic variant of the add instruction is called lock xadd

Snail answered 8/9, 2016 at 14:54 Comment(4)

lock xadd implements C++ std::atomic fetch_add, returning the old value. If you don't need that, the compiler will use the normal memory destination instructions with a lock prefix. lock add or lock inc. – Humiliating 8/9, 2016 at 16:2

add [mem], 1 would still not be atomic on an SMP machine with no cache, see my comments on other answers. – Humiliating 8/9, 2016 at 16:4

See my answer for a lot more details on exactly how it's not atomic. Also the end of my answer on this related question. – Humiliating 8/9, 2016 at 17:32

Also, more fundamentally, no, two cores can't have different values in cache for the same memory; caches are coherent. Please don't spread misinformation about how CPUs work. See also, Myths Programmers Believe about CPU Caches (Java volatile is like C++ std::atomic<> with memory_order_seq_cst). See Margaret's answer for how two cores reading the same value leads to a problem. – Humiliating 12/4, 2021 at 4:56

Since line 5, which corresponds to num++ is one instruction, can we conclude that num++ is atomic in this case?

It is dangerous to draw conclusions based on "reverse engineering" generated assembly. For example, you seem to have compiled your code with optimization disabled, otherwise the compiler would have thrown away that variable or loaded 1 directly to it without invoking operator++. Because the generated assembly may change significantly, based on optimization flags, target CPU, etc., your conclusion is based on sand.

Also, your idea that one assembly instruction means an operation is atomic is wrong as well. This add will not be atomic on multi-CPU systems, even on the x86 architecture.

Laryngeal answered 8/9, 2016 at 14:54 Comment(0)

On a single-core x86 machine, an add instruction will generally be atomic with respect to other code on the CPU¹. An interrupt can't split a single instruction down the middle.

Out-of-order execution is required to preserve the illusion of instructions executing one at a time in order within a single core, so any instruction running on the same CPU will either happen completely before or completely after the add.

Modern x86 systems are multi-core, so the uniprocessor special case doesn't apply.

If one is targeting a small embedded PC and has no plans to move the code to anything else, the atomic nature of the "add" instruction could be exploited. On the other hand, platforms where operations are inherently atomic are becoming more and more scarce.

(This doesn't help you if you're writing in C++, though. Compilers don't have an option to require num++ to compile to a memory-destination add or xadd without a lock prefix. They could choose to load num into a register and store the increment result with a separate instruction, and will likely do that if you use the result.)

Footnote 1: The lock prefix existed even on original 8086 because I/O devices operate concurrently with the CPU; drivers on a single-core system need lock add to atomically increment a value in device memory if the device can also modify it, or with respect to DMA access.

Adorl answered 8/9, 2016 at 17:15 Comment(20)

It isn't even generally atomic: Another thread can update the same variable at the same time and only one update is taken over. – Boogie 8/9, 2016 at 17:31

@FUZxxl: Certainly on the 8088 and 80286, interrupts for task switching could occur only between instructions. 80386 added the possibility of page-faults, but interrupted instructions would start over from scratch. I haven't followed all the changes to the inner workings of later chips, but I think the restart-from-scratch approach has been retained. – Adorl 8/9, 2016 at 17:39

Consider a multi-core system. Of course, within one core, the instruction is atomic, but it isn't atomic with respect to the entire system. – Boogie 8/9, 2016 at 19:14

An interrupt happens between two instructions, and can't split a RMW down the middle. This makes it safe on a UP system for synchronizing between the kernel and interrupt handlers, for example (see the 2nd half of my answer here, and part of my answer on this question). But un-locked insns aren't ok for doing an atomic RMW on an MMIO register or device memory. The lock prefix existed before SMP CPUs for this reason: to assert the #LOCK bus signal. – Humiliating 8/9, 2016 at 19:30

@FUZxxl: What were the fourth and fifth words of my answer? – Adorl 8/9, 2016 at 19:35

@PeterCordes: Systems which use DMA to modify the target of a RMW instruction would need "lock", but I find it hard to imagine wanting to do that. I don't think a lock prefix would do anything for most kinds of memory-mapped I/O, since either a register is going to always read the value written (in which case lock would be unnecessary) or may change in response to things other than memory-write requests on the bus (which would generally be the only thing lock would prevent). – Adorl 8/9, 2016 at 19:40

I'm not entirely clear on what lock was actually used for on pre-SMP hardware like 8086. I checked, and LOCK was part of original 8086, not just added with 386 for SMP. Would you use it for a network card that has to synchronize with the host CPU while they read and write buffers holding packets? Maybe from device memory on the network card, if not through DMA. – Humiliating 8/9, 2016 at 19:51

@Adorl Your answer is very misleading because it only considers the nowadays rare case of a single core and gives OP a false sense of security. That's why I commented to consider the multi-core case, too. – Boogie 8/9, 2016 at 20:8

@FUZxxl: I made an edit to clear up potential confusion for readers who didn't notice that this isn't talking about normal modern multicore CPUs. (And also be more specific about some stuff that supercat wasn't sure of). BTW, everything in this answer is already in mine, except the last sentence about how platforms where read-modify-write is atomic "for free" are rare. – Humiliating 9/9, 2016 at 20:13

For this purpose, I think alignment will not matter. The interrupt will not occur with part of the instruction complete. It is a requirement for using the LOCK prefix though. – Facia 9/9, 2016 at 20:49

@JDługosz: that's correct, alignment doesn't matter. It just makes the separate load and store operations properly atomic. (On P6 and newer, 64-bit or smaller loads and stores are atomic as long as they don't cross a cache-line boundary.) LOCK ADD works on unaligned addresses, but it's likely to be slow if it crosses a cache line boundary. IIRC, AMD's optimization manual says it takes a bus lock for LOCKed ops that aren't naturally aligned (instead of just a cache lock on the affected line). – Humiliating 9/9, 2016 at 21:4

@JDługosz: If the operation is not aligned and one half of the target is in a write-protected page, it would be possible for an interrupt which occurs during the page fault to see lower half of the value written and the upper half not. – Adorl 9/9, 2016 at 21:24

@supercat: your last comment isn't right. A page fault on a page-split store will cause both halves of the store to be cancelled before the exception is taken. Only stuff like AVX512 scatter, or rep stos, or other explicitly restartable instructions can architecturally partially complete, then take an exception. They leave register values updated to avoid storing the same data to the same place twice (architecturally). So when the page fault finishes, the instruction re-executes with different inputs that don't store to any locations that already completed. – Humiliating 25/2, 2018 at 2:30

@PeterCordes: Interesting. Having instructions hold control over more than one page at a time complicates things, but I guess Intel was willing to pay that price. Has it always been thus, going back all the way to the 80386? – Adorl 26/2, 2018 at 19:54

I think it would have to be. Without a store queue or a real pipeline, I assume 386 would just check permissions for both pages before doing either store at all, instead of just before letting them commit to L1D. (Probably in a slow fallback for page-split stores, maybe microcoded. Even on Broadwell, page-split loads/stores have a huge penalty, like one per 140 cycles throughput, vs. one per 5 cycles or so on Skylake, so it was only very recently that Intel finally built really optimized HW to handle page splits without a slow fallback to get it right.) – Humiliating 26/2, 2018 at 20:0

@PeterCordes: On the 80386, any write that doesn't fall entirely within a single 32-bit word is guaranteed to result in two separate bus cycles for the store, in addition to any page table lookups in needed in case of a TLB miss. Starting the first bus cycle as soon as the first page lookup is complete would allow the TLB lookup for the second address to occur while that bus cycle is being performed. Perhaps Intel added an extra cycle penalty to such cases by waiting until the second TLB lookup occurs before operating upon the first address, or perhaps Intel used... – Adorl 26/2, 2018 at 23:7

...TLB hardware that could look up two consecutive pages simultaneously (e.g. by having one TLB for even-numbered pages and one for odd pages), but treating the halves of a split store as independent would seem to allow a better combination of speed and complexity than doing both TLB lookups before the first store. – Adorl 26/2, 2018 at 23:9

@supercat: IDK what the perf penalty is on 386. I expect the delay is more likely; a split TLB would be less efficient for some access patterns. Page splits in existing code must have been rare when P6 was designed, otherwise it wouldn't have sucked so much all the way until Skylake. While searching for that, I found an Intel 386SX data sheet that confirms that a page fault prevents "transferring the entire memory quantity". I didn't find any more specific mention of page splits, but the TLB is a singe 32-entry. – Humiliating 26/2, 2018 at 23:28

Even a most multi-core operating systems, I think there are ways that programs can specify that they should use only a single core. If there's enough other work to keep all the cores busy, eliminating the need for multi-processor coordination may improve overall efficiency [the single-core task may not be able to run as fast as it would if run multi-core, but would take less total CPU time away from other tasks]. – Adorl 16/7, 2018 at 21:13

Re: earlier discussion of how many pages might need to be active at once for an x86 CPU to make forward progress: 6, for one step of a 2-byte rep movsd or 32-bit mode movsw with page-crossing misaligned src and dst. Do x86 instructions require their own encoding as well as all of their arguments to be present in memory at the same time? – Humiliating 25/3, 2023 at 22:57

Even if your compiler always emitted this as an atomic operation, accessing num from any other thread concurrently would constitute a data race according to the C++11 and C++14 standards and the program would have undefined behavior.

But it is worse than that. First, as has been mentioned, the instruction generated by the compiler when incrementing a variable may depend on the optimization level. Secondly, the compiler may reorder other memory accesses around ++num if num is not atomic, e.g.

int main()
{
  std::unique_ptr<std::vector<int>> vec;
  int ready = 0;
  std::thread t{[&]
    {
       while (!ready);
       // use "vec" here
    });
  vec.reset(new std::vector<int>());
  ++ready;
  t.join();
}

Even if we assume optimistically that ++ready is "atomic", and that the compiler generates the checking loop as needed (as I said, it's UB and therefore the compiler is free to remove it, replace it with an infinite loop, etc.), the compiler might still move the pointer assignment, or even worse the initialization of the vector to a point after the increment operation, causing chaos in the new thread. In practice, I would not be surprised at all if an optimizing compiler removed the ready variable and the checking loop completely, as this does not affect observable behavior under language rules (as opposed to your private hopes).

In fact, at last year's Meeting C++ conference, I've heard from two compiler developers that they very gladly implement optimizations that make naively written multi-threaded programs misbehave, as long as language rules allow it, if even a minor performance improvement is seen in correctly written programs.

Lastly, even if you didn't care about portability, and your compiler was magically nice, the CPU you are using is very likely of a superscalar CISC type and will break down instructions into micro-ops, reorder and/or speculatively execute them, to an extent only limited by synchronizing primitives such as (on Intel) the LOCK prefix or memory fences, in order to maximize operations per second.

To make a long story short, the natural responsibilities of thread-safe programming are:

Your duty is to write code that has well-defined behavior under language rules (and in particular the language standard memory model).
Your compiler's duty is to generate machine code which has the same well-defined (observable) behavior under the target architecture's memory model.
Your CPU's duty is to execute this code so that the observed behavior is compatible with its own architecture's memory model.

If you want to do it your own way, it might just work in some cases, but understand that the warranty is void, and you will be solely responsible for any unwanted outcomes. :-)

PS: Correctly written example:

int main()
{
  std::unique_ptr<std::vector<int>> vec;
  std::atomic<int> ready{0}; // NOTE the use of the std::atomic template
  std::thread t{[&]
    {
       while (!ready);
       // use "vec" here
    });
  vec.reset(new std::vector<int>());
  ++ready;
  t.join();
}

This is safe because:

The checks of ready cannot be optimized away according to language rules.
The ++ready happens-before the check that sees ready as not zero, and other operations cannot be reordered around these operations. This is because ++ready and the check are sequentially consistent, which is another term described in the C++ memory model and that forbids this specific reordering. Therefore the compiler must not reorder the instructions, and must also tell the CPU that it must not e.g. postpone the write to vec to after the increment of ready. Sequentially consistent is the strongest guarantee regarding atomics in the language standard. Lesser (and theoretically cheaper) guarantees are available e.g. via other methods of std::atomic<T>, but these are definitely for experts only, and may not be optimized much by the compiler developers, because they are rarely used.

Tierratiersten answered 8/9, 2016 at 17:17 Comment(1)

If the compiler couldn't see all uses of ready, it would probably compile while (!ready); into something more like if(!ready) { while(true); }. Upvoted: a key part of std::atomic is changing the semantics to assume asynchronous modification at any point. Having it be UB normally is what allows compilers to hoist loads and sink stores out of loops. – Humiliating 8/9, 2016 at 19:42

Back in the day when x86 computers had one CPU, the use of a single instruction ensured that interrupts would not split the read/modify/write and if the memory would not be used as a DMA buffer too, it was atomic in fact (and C++ did not mention threads in the standard, so this wasn’t addressed).

When it was rare to have a dual processor (e.g. dual-socket Pentium Pro) on a customer desktop, I effectively used this to avoid the LOCK prefix on a single-core machine and improve performance.

Today, it would only help against multiple threads that were all set to the same CPU affinity, so the threads you are worried about would only come into play via time slice expiring and running the other thread on the same CPU (core). That is not realistic.

With modern x86/x64 processors, the single instruction is broken up into several micro ops and furthermore the memory reading and writing is buffered. So different threads running on different CPUs will not only see this as non-atomic but may see inconsistent results concerning what it reads from memory and what it assumes other threads have read to that point in time: you need to add memory fences to restore sane behavior.

Bellied answered 9/9, 2016 at 14:48 Comment(15)

Interrupts still don't split RMW operations, so they do still synchronize a single thread with signal handlers that run in the same thread. Of course, this only works if the asm uses a single instruction, not separate load/modify/store. C++11 could expose this hardware functionality, but it doesn't (probably because it was only really useful in Uniprocessor kernels to synchronize with interrupt handlers, not in user-space with signal handlers). Also architectures don't have read-modify-write memory-destination instructions. Still, it could just compile like a relaxed atomic RMW on non-x86 – Humiliating 9/9, 2016 at 16:17

Though as I recall, using the Lock prefix wasn't absurdly expensive until the superscalers came along. So there was no reason to notice it as slowing down the important code in a 486, even though it was not needed by that program. – Facia 9/9, 2016 at 16:35

Yes sorry! I didn't actually read carefully. I saw the start of the paragraph with the red herring about decoding to uops, and didn't finish reading to see what you actually said. re: 486: I think I've read that the earliest SMP was some kind of Compaq 386, but its memory-ordering semantics weren't the same as what the x86 ISA currently says. The current x86 manuals may even mention SMP 486. They certainly weren't common even in HPC (Beowulf clusters) until PPro / Athlon XP days, though, I think. – Humiliating 9/9, 2016 at 16:48

I recall an editorial in Computer Language or somewhere around the 386/486 era reporting on a multi-CPU board, and how under Unix it changes everything since they have lots of different processes running around. But prior to Zortech running a C++ compiler on a ’386 era machine was highly impractical, despite a few brands bringing Cfront-based implementations to market. – Facia 9/9, 2016 at 19:38

Your first two paragraphs describe exactly the trick I intuitively had in mind in my original question (my interest is theoretical). Just didn't formulate it within a single-core system. Ok, so, let's isolate this from C++ and let's say I wrote that piece of asm by hand, and I'm on a single core system, so threads context-switch for CPU time. In this scenario, am I right to assume that the add instruction will, in fact, be atomic (in the sense that includes its effects being observable as a single step), and I could avoid LOCKing add, like you say you did in the past? Also, @PeterCordes? – Inning 9/9, 2016 at 20:21

Is there any reason the all-related-threads-have-CPU-affinity should be "unrealistic"? If one has enough different jobs that it would be possible to keep all cores busy even if many particular tasks can each only be run on one CPU at a time (e.g. four CPUs could be running four independent tasks), I would think that being able to designate tasks as being limited to one CPU and then eliminate all intra-task memory interlocks could often be a performance win. – Adorl 9/9, 2016 at 20:24

@Adorl with preemptive threads you will lose efficency running one that needs to wait for the other to get out of a dritical section. With multiple cores we’ve moved to low-overhead spinlocks. To do what you suggest you’ll go all the way to non-preemptive like fibers or await constructs. Something I’ve not considered is embedded systems and asynchrous signals; that might work to reduce overhead on the locking, under well understood conditions. – Facia 9/9, 2016 at 20:33

@LeoHeinsaar yes: interrupts always go between asm instructions and flush the pipelines etc. so it appears as an old one-instruction-at-a-time CPU to software under those specified conditions. And only the little helper function needs to be hand coded in asm. – Facia 9/9, 2016 at 20:45

@LeoHeinsaar: Yes, non-locked ADD works then. @ supercat's answer makes the same point, and so does the uniprocessor section in my answer. Unless there are any DMA/device observers (e.g. you're writing a driver for a network card that also sets flags to indicate that a packet buffer contains a valid packet now), in which case you do need the lock add to make it really atomic with respect to non-CPU observers in the system. See also the 2nd half of this answer, and the discussion in comments there where non-CPU observers are discussed. – Humiliating 9/9, 2016 at 20:45

@supercat: interesting point. Yes, that would maybe reduce overhead when you're using threads for i/o or similar things, so they usually sleep while not holding any locks. You might tune a system like that by adding sched_yield() calls at points where no locks are held. @ JDługosz: I think you're over-stating the downside here. If critical sections are short and infrequent, threads won't usually just happen to use up their timeslice while inside a critical section. This could be a good way to deliver the convenience of threads for non-CPU-bound programs. – Humiliating 9/9, 2016 at 20:51

@JDługosz: If a lock is used only to synchronize operations within a single process on a single core, and a thread tries to enter a lock that has been acquired by another thread, there's no need to spend CPU time "waiting" for anything--transfer execution immediately to the thread that holds the lock. – Adorl 9/9, 2016 at 20:51

@PeterCordes Ok. Sure, assuming also no DMA/device observers - didn't fit in the comment area to include that one also. Thanks JDługosz for excellent addition (answer as well as comments). Really completed the discussion. – Inning 9/9, 2016 at 20:52

@Leo: One key point that hasn't been mentioned: out-of-order CPUs do reorder things internally, but the golden rule is that for a single core, they preserve the illusion of instructions running one at a time, in order. (And this includes interrupts that trigger context switches). Values might be electrically stored into memory out of order, but the single core that everything is running on keeps track of all the reordering it does itself, to preserve the illusion. This is why you don't need a memory barrier for the asm equivalent of a = 1; b = a; to correctly load the 1 you just stored. – Humiliating 9/9, 2016 at 20:55

@JDługosz: If a process will only use a single core, and the locking primitives are designed for that, the time required for a thread switch on the contested-lock case may be less than the time required to acquire and release a lock that was held by another core but is otherwise uncontested. – Adorl 9/9, 2016 at 21:40

Update to my last comment: storing into memory is externally visible, and some ISAs (notably x86) have memory-order rules that govern that visibility. Also, on any CPU, mis-speculated stores must not become visible to anything outside the core. The solution to decoupling store visibility from execution order is a store buffer. (Extra loads from cacheable regions of DRAM aren't considered a visible side-effect, thus device memory / MMIO regs should be mapped uncacheable.) – Humiliating 25/3, 2023 at 22:51

No. https://www.youtube.com/watch?v=31g0YE61PLQ (That's just a link to the "No" scene from "The Office")

Do you agree that this would be a possible output for the program:

sample output:

If so, then the compiler is free to make that the only possible output for the program, in whichever way the compiler wants. ie a main() that just puts out 100s.

This is the "as-if" rule.

And regardless of output, you can think of thread synchronization the same way - if thread A does num++; num--; and thread B reads num repeatedly, then a possible valid interleaving is that thread B never reads between num++ and num--. Since that interleaving is valid, the compiler is free to make that the only possible interleaving. And just remove the incr/decr entirely.

There are some interesting implications here:

while (working())
    progress++;  // atomic, global

(ie imagine some other thread updates a progress bar UI based on progress)

Can the compiler turn this into:

int local = 0;
while (working())
    local++;

progress += local;

probably that is valid. But probably not what the programmer was hoping for :-(

The committee is still working on this stuff. Currently it "works" because compilers don't optimize atomics much. But that is changing.

And even if progress was also volatile, this would still be valid:

int local = 0;
while (working())
    local++;

while (local--)
    progress++;

:-/

Ethelind answered 9/9, 2016 at 14:0 Comment(4)

This answer seems to only be answering the side-question that Richard and I were pondering. We eventually resolved it: turns out that yes, the C++ standard does allow merging of operations on non-volatile atomic objects, when it doesn't break any other rules. Two standards-discussion documents discuss exactly this (links in Richard's comment), one using the same progress-counter example. So it's a quality-of-implementation issue until C++ standardizes ways to prevent it. – Humiliating 9/9, 2016 at 14:32

Yeah, my "No" is really a reply to the whole line of reasoning. If the question is just "can num++ be atomic on some compiler/implementation", the answer is sure. For example, a compiler could decide to add lock to every operation. Or some compiler+uniprocessor combination where neither did reordering (ie "the good ol' days") everything is atomic. But what's the point of that? You can't really rely on it. Unless you know that's the system you are writing for. (Even then, better would be that atomic<int> adds no extra ops on that system. So you should still write standard code...) – Ethelind 12/9, 2016 at 22:21

Note that And just remove the incr/decr entirely. isn't quite right. It's still an acquire and release operation on num. On x86, num++;num-- could compile to just MFENCE, but definitely not nothing. (Unless the compiler's whole-program analysis can prove that nothing sychronizes with that modification of num, and that it doesn't matter if some stores from before that are delayed until after loads from after that.) E.g. if this was an unlock and re-lock-right-away use-case, you still have two separate critical sections (maybe using mo_relaxed), not one big one. – Humiliating 12/9, 2016 at 22:32

@PeterCordes ah yes, agreed. – Ethelind 14/9, 2016 at 5:39

That a single compiler's output, on a specific CPU architecture, with optimizations disabled (since gcc doesn't even compile ++ to add when optimizing in a quick&dirty example), seems to imply incrementing this way is atomic doesn't mean this is standard-compliant (you would cause undefined behavior when trying to access num in a thread), and is wrong anyways, because add is not atomic in x86.

Note that atomics (using the lock instruction prefix) are relatively heavy on x86 (see this relevant answer), but still remarkably less than a mutex, which isn't very appropriate in this use-case.

Following results are taken from clang++ 3.8 when compiling with -Os.

Incrementing an int by reference, the "regular" way :

void inc(int& x)
{
    ++x;
}

This compiles into :

inc(int&):
    incl    (%rdi)
    retq

Incrementing an int passed by reference, the atomic way :

#include <atomic>

void inc(std::atomic<int>& x)
{
    ++x;
}

This example, which is not much more complex than the regular way, just gets the lock prefix added to the incl instruction - but caution, as previously stated this is not cheap. Just because assembly looks short doesn't mean it's fast.

inc(std::atomic<int>&):
    lock            incl    (%rdi)
    retq

Constrain answered 8/9, 2016 at 19:30 Comment(0)

Yes, but...

Atomic is not what you meant to say. You're probably asking the wrong thing.

The increment is certainly atomic. Unless the storage is misaligned (and since you left alignment to the compiler, it is not), it is necessarily aligned within a single cache line. Short of special non-caching streaming instructions, each and every write goes through the cache. Complete cache lines are being atomically read and written, never anything different.
Smaller-than-cacheline data is, of course, also written atomically (since the surrounding cache line is).

Is it thread-safe?

This is a different question, and there are at least two good reasons to answer with a definite "No!".

First, there is the possibility that another core might have a copy of that cache line in L1 (L2 and upwards is usually shared, but L1 is normally per-core!), and concurrently modifies that value. Of course that happens atomically, too, but now you have two "correct" (correctly, atomically, modified) values -- which one is the truly correct one now?
The CPU will sort it out somehow, of course. But the result may not be what you expect.

Second, there is memory ordering, or worded differently happens-before guarantees. The most important thing about atomic instructions is not so much that they are atomic. It's ordering.

You have the possibility of enforcing a guarantee that everything that happens memory-wise is realized in some guaranteed, well-defined order where you have a "happened before" guarantee. This ordering may be as "relaxed" (read as: none at all) or as strict as you need.

For example, you can set a pointer to some block of data (say, the results of some calculation) and then atomically release the "data is ready" flag. Now, whoever acquires this flag will be led into thinking that the pointer is valid. And indeed, it will always be a valid pointer, never anything different. That's because the write to the pointer happened-before the atomic operation.

Telophase answered 8/9, 2016 at 18:7 Comment(7)

The load and the store are each atomic separately, but the entire read-modify-write operation as a whole is definitely not atomic. Caches are coherent, so can never hold conflicting copies of the same line (en.wikipedia.org/wiki/MESI_protocol). Another core can't even have a read-only copy while this core has it in the Modified state. What makes it non-atomic is that the core doing the RMW can lose ownership of the cache line between the load and the store. – Humiliating 8/9, 2016 at 19:34

Also, no, whole cache lines aren't always transferred around atomically. See this answer, where it's experimentally demonstrated that a multi-socket Opteron makes 16B SSE stores non-atomic by transferring cache lines in 8B chunks with hypertransport, even though they are atomic for single-socket CPUs of the same type (because the load/store hardware has a 16B path to L1 cache). x86 only guarantees atomicity for separate loads or stores up to 8B. – Humiliating 8/9, 2016 at 19:38

Leaving alignment to compiler does not mean that memory will be aligned on 4-byte boundary. Compilers can have options or pragmas to change the alignment boundary. This is useful, for example, for operating on tightly-packed data in network streams. – Springlet 9/9, 2016 at 20:31

Sophistries, nothing else. An integer with automatic storage which isn't part of a struct as shown in the example will absolutely positively be correctly aligned. Claiming anything different is just outright silly. Cache lines as well as all PODs are PoT (power-of-two) sized and aligned -- on any non-illusory architecture in the world. Math has it that any properly aligned PoT fits into exactly one (never more) of any other PoT of the same size or larger. My statement is therefore correct. – Telophase 10/9, 2016 at 17:4

@Damon: yes, I agree that any sane system will align an int static global or automatic local to whatever alignment is necessary for a separate load or store to be atomic. Your comment is correct. But your answer still uses the wrong reasoning to explain what happens. The ADD instruction doesn't atomically read-modify-write a cache line (without a LOCK prefix), and conflicting copies of the same cache line can't exist. And my previous comment proves that cache lines aren't transferred atomically in all x86 systems. (Although I think that's a rare special case, and they are in most x86) – Humiliating 12/9, 2016 at 18:55

The ordering part of the answer is correct, but it's unrelated to whether num++ is atomic or not: memory_order_relaxed atomic increments will still get the right total (even on ARM where that allows different asm). x86 atomic RMW always requires a LOCK prefix, which gives mo_seq_cst semantics because it's also a full memory barrier. ARM, OTOH, can use load-linked/store-conditional to do an atomic RMW without even acquire/release semantics, but cache coherency still makes sure that multiple threads incrementing a shared counter this way get the "expected" total. – Humiliating 12/9, 2016 at 19:1

@Damon, the example given in the question does not mention a struct, but it doesn't narrow the question to just the situations where integers are not parts of structs. PODs most definitely can have PoT size and not be PoT aligned. Take a look at this answer for syntax examples: https://mcmap.net/q/14869/-what-is-the-meaning-of-quot-__attribute__-packed-aligned-4-quot. So it's hardly a "sophistry" because PODs declared in such way are used in networking code quite a bit in real-life code. – Springlet 5/10, 2016 at 18:4

-2

When your compiler uses only a single instruction for the increment and your machine is single-threaded, your code is safe. ^^

Nel answered 6/5, 2017 at 6:2 Comment(0)

-3

Try compiling the same code on a non-x86 machine, and you'll quickly see very different assembly results.

The reason num++ appears to be atomic is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic (assuming no memory retrieval takes place). But this is neither guaranteed by the c++ standard, nor is it likely to be the case on a machine that doesn't use the x86 instruction set. So this code is not cross-platform safe from race conditions.

You also don't have a strong guarantee that this code is safe from Race Conditions even on an x86 architecture, because x86 doesn't set up loads and stores to memory unless specifically instructed to do so. So if multiple threads tried to update this variable simultaneously, they may end up incrementing cached (outdated) values

The reason, then, that we have std::atomic<int> and so on is so that when you're working with an architecture where the atomicity of basic computations is not guaranteed, you have a mechanism that will force the compiler to generate atomic code.

Gatecrasher answered 8/9, 2016 at 14:44 Comment(5)

"is because on x86 machines, incrementing a 32-bit integer is, in fact, atomic." can you provide link to documentation that proofs it? – Laryngeal 8/9, 2016 at 14:46

It isn't atomic on x86 either. It's single-core-safe, but if there are multiple cores (and there are) it's not atomic at all. – Collegian 8/9, 2016 at 14:49

Is x86 add actually guaranteed atomic? I wouldn't be surprised if register increments were atomic, but that's hardly useful; to make the register increment visible to another thread it needs to be in memory, which would require additional instructions to load and store it, removing the atomicity. My understanding is that this is why the lock prefix exists for instructions; the only useful atomic add applies to dereferenced memory, and uses the lock prefix to ensure the cache line is locked for the duration of the operation. – Caudillo 8/9, 2016 at 14:51

@Laryngeal @Harold @Caudillo I updated the answer. add is atomic, but I made clear that that doesn't imply that the code is race-condition safe, because changes don't become globally visible right away. – Gatecrasher 8/9, 2016 at 14:56

@Gatecrasher that makes it "not atomic" by definition though – Collegian 8/9, 2016 at 15:27

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++