What's the difference between T, volatile T, and std::atomic<T>?
Asked Answered
W

2

7

Given the following sample that intends to wait until another thread stores 42 in a shared variable shared without locks and without waiting for thread termination, why would volatile T or std::atomic<T> be required or recommended to guarantee concurrency correctness?

#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>

int main()
{
  int64_t shared = 0;
  std::thread thread([&shared]() {
    shared = 42;
  });
  while (shared != 42) {
  }
  assert(shared == 42);
  thread.join();
  return 0;
}

With GCC 4.8.5 and default options, the sample works as expected.

Winstead answered 24/3, 2021 at 21:52 Comment(5)
Why I voted to reopen: IMHO, this question and answer add value because they discuss the topic based on a specific example rather than broadly without specifics, making it harder to answer and require answers to be more general than in this case. Also, because there is a sample, the answer provide evidence to support claims on how volatile and std::atomic are supposed to work.Winstead
Agreed. Specific should trump general except when the specific essentially repeats general. The former dupe should live on as a good reference, though. Concurrency: Atomic and volatile in C++11 memory modelSansom
GCC's default options don't include optimization, so every variable is treated somewhat like volatile. That part is basically a duplicate of Multithreading program stuck in optimized mode but runs normally in -O0, and also When to use volatile with multi threading?Stane
Only using atomic is correct, others are UB with at least race accesses (even with volatile).Gratt
I don't thing the top paragraph of the question is a good place for self-justification. If anything, a comment (which already exists). Even a note at the bottom of the question would be too much, IMO, but maybe. Especially if you keep it much shorter, e.g. "this question is mostly a placeholder and example for my answer". (Where "my answer" can be a link to the help page explaining that self-answers are good, if you feel you need that. But I think at this point with the question having a positive score and close votes currently pending, you don't need further justification.)Stane
W
12

The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years.

We can start off by compiling the sample with -O3. Now, the sample hangs indefinitely. (The default is -O0, no optimization / debug-consistency, which is somewhat similar to making every variable volatile, which is the reason the test didn't reveal the code as unsafe.)

To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 -O0 based x86_64 assembly corresponding to the un-optimized working binary:

        // Thread B:
        // shared = 42;
        movq    -8(%rbp), %rax
        movq    (%rax), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L11:
        movq    -32(%rbp), %rax     # Check shared every iteration
        cmpq    $42, %rax
        jne     .L11

Thread B executes a simple store of the value 42 in shared. Thread A reads shared for each loop iteration until the comparison indicates equality.

Now, we compare that to the -O3 outcome:

        // Thread B:
        // shared = 42;
        movq    8(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
        cmpq    $42, (%rsp)         # check shared once
        je      .L87                # and skip the infinite loop or not
.L88:
        jmp     .L88                # infinite loop
.L87:

Optimizations associated with -O3 replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.)

The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that shared cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.)

The solution requires us to communicate to the compiler that shared is involved in inter-thread communication. One way to accomplish that may be volatile. While the actual meaning of volatile varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic.

With volatile int64_t shared, the generated instructions change as follows:

        // Thread B:
        // shared = 42;
        movq    24(%rdi), %rax
        movq    $42, (%rax)

        // Thread A:
        // while (shared != 42) {
        // }
.L87:
        movq    8(%rsp), %rax
        cmpq    $42, %rax
        jne     .L87

The loop cannot be eliminated anymore as it must be assumed that shared changed even though there's no evidence of that in the form of code. As a result, the sample now works with -O3.

If volatile fixes the issue, why would you ever need std::atomic? Two aspects relevant for lock-free code are what makes std::atomic essential: memory operation atomicity and memory order.

To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 -O3 -m32 (the 32-bit version) for volatile int64_t shared:

        // Thread B:
        // shared = 42;
        movl    4(%esp), %eax
        movl    12(%eax), %eax
        movl    $42, (%eax)
        movl    $0, 4(%eax)

        // Thread A:
        // while (shared != 42) {
        // }
.L88:                               # do {
        movl    40(%esp), %eax
        movl    44(%esp), %edx
        xorl    $42, %eax
        movl    %eax, %ecx
        orl     %edx, %ecx
        jne     .L88                # } while(shared ^ 42 != 0);

For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms.

To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic can be employed. Let's review how std::atomic affects the generated assembly. The updated sample:

#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>

int main()
{
  std::atomic<int64_t> shared;
  std::thread thread([&shared]() {
    shared.store(42, std::memory_order_relaxed);
  });
  while (shared.load(std::memory_order_relaxed) != 42) {
  }
  assert(shared.load(std::memory_order_relaxed) == 42);
  thread.join();
  return 0;
}

The generated 32-bit assembly based on GCC 10.2 (-O3: https://godbolt.org/z/8sPs55nzT):

        // Thread B:
        // shared.store(42, std::memory_order_relaxed);
        movl    $42, %ecx
        xorl    %ebx, %ebx
        subl    $8, %esp
        movl    16(%esp), %eax
        movl    4(%eax), %eax       # function arg: pointer to  shared
        movl    %ecx, (%esp)
        movl    %ebx, 4(%esp)
        movq    (%esp), %xmm0       # 8-byte reload
        movq    %xmm0, (%eax)       # 8-byte store to  shared
        addl    $8, %esp

        // Thread A:
        // while (shared.load(std::memory_order_relaxed) != 42) {
        // }
.L9:                                # do {
        movq    -16(%ebp), %xmm1       # 8-byte load from shared
        movq    %xmm1, -32(%ebp)       # copy to a dummy temporary
        movl    -32(%ebp), %edx
        movl    -28(%ebp), %ecx        # and scalar reload
        movl    %edx, %eax
        movl    %ecx, %edx
        xorl    $42, %eax
        orl     %eax, %edx
        jne     .L9                 # } while(shared.load() ^ 42 != 0);

To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 movq instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though volatile was removed.

By using std::atomic in the sample, it is guaranteed that

  • std::atomic loads and stores are not subject to register-based caching
  • std::atomic loads and stores do not allow partial values to be observed

The C++ standard doesn't talk about registers at all, but it does say:

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

While that leaves room for interpretation, caching std::atomic loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration.

On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use movq.

atomic<T> with a large T may not be supported directly by hardware, in which case the compiler can fall back to using a mutex.

A large T (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, lock cmpxchg16, or ARM ldrexd/strexd retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of clang -m32 for x86 will use lock cmpxchg8b instead of movq for 8-byte pure-load or pure-store.

What's the second aspect mentioned above and what does std::memory_order_relaxed mean? Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic loads and stores. std::memory_order_relaxed does not impose any particular order.

Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.

Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T> with std::memory_order_relaxed (unless the scenario calls for a specific memory order) to volatile T (and, of course, T). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.

Winstead answered 24/3, 2021 at 21:52 Comment(21)
The test indicates that the sample is correct - I'd suggest phrasing as "the test doesn't detect any problems". As you say, that doesn't mean there aren't any, but lockless code is notoriously hard to test, especially if you only have an x86 (where runtime reordering is limited, so only compile-time reordering can break some thing, unlike on ARM or POWER).Stane
Re: compilers keeping atomic values in registers: the ISO C++ standard's "as-if" rule does apply to atomics, with the language in the standard only talking about possible values that program might or must observe under certain conditions. But it turns out to be more problematic than originally realized, so current compilers treat atomic<T> essentially like volatile atomic<T>: Why don't compilers merge redundant std::atomic writes? links WG21/P0062R1 and N4455.Stane
Thanks, @PeterCordes, I changed the sentence slightly to emphasise that the sample is indeed incorrect.Winstead
Re: when to use volatile (and why exactly cache coherency makes it even work at all in practice, despite not avoiding data-race UB in ISO C++): When to use volatile with multi threading? - pretty much never.Stane
My point about testing was that that's the wrong way to think about a test. A test that doesn't find a problem doesn't show that code is correct. This is fundamental in a language that has undefined behaviour, and especially for multi-threaded code when different test platforms can only exercise some kinds of reordering or stalls / thread sleeping / ordering. Especially if you aren't using a UB-sanitizer and/or race-detection simulator. (If you are using those things, you can be a lot more sure about successful test => no problem.)Stane
What platforms did you have in mind where a "cache-lock or bus-lock" is necessary? The phrasing seems to imply that software could explicitly take a cache-lock, but I'm pretty sure that's not what you meant because that's not how it works. (Something like x86 lock cmpxchg8b [mem] (or 16b in long mode) just asks the HW to do it atomically, and yes that will internally involve a cache-lock and unlock if it's not split across cache lines). So I find it strange that you bring all that up but never mention the fact that atomic RMWs are the main use-case for HW cache locking.Stane
Yes, @PeterCordes, thank again. I know about that question and the associated discussion. The sentence in the standard saying that "implementations should make atomic stores visible to atomic loads within a reasonable amount of time." indicates to me that the as-if rule cannot apply in this case because register caching would violate the notion of "making stores visible within a reasonable amount of time". You're definitely right in terms of redundant stores or the famous "progress bar" discussion (deferring the actual store until 100 % was reached), though.Winstead
I fully agree. A test has little value in terms of determining concurrency correctness.Winstead
Keeping a val in a register indefinitely is obviously ruled out by that, but there are lots of things compilers can do. e.g. an atomic store, and then reload of the same value in the same thread, could be assumed to always get the just-stored value. That's one of the possible values allowed by the standard, so the compiler could decide (at compile time) that's what always happens. Or more simply, for a couple loads of the same atomic var within the same loop iteration, with no other loads in between, it's legal for the compiler to decide they always get the same value. Or back 2 back storesStane
@PeterCordes The point I intended to raise is that the implementation of std::atomic might have to resort to locking a cache line by using atomic instructions to perform the load/store. Of course, that's what the CPU handles internally and nothing that could be controlled using the instruction set - at least there's no architecture that comes to mind that would provide that level of control. But yes, it does not sound like a consequence of the atomic instruction in the original text. Maybe I can reword that to clear this up. Thanks.Winstead
... cache locks and so on. IIRC, some ARM versions can only do an 8-byte atomic store when it's part of an LL/SC. (LDREXD/STREXD). Is that what you meant? That does have to commit the whole store to cache at once, even if that takes 2 cycles. But making the whole RMW atomic does not take a cache lock. The store simply fails if the whole RMW could have appeared non-atomic to another observer, by monitoring that cache line to see if we lost ownership of it, for example. Other ARMs can do 8-byte atomic load/store like x86, as a pure load or store (but with a pair of GP regs).Stane
@PeterCordes AFAICR, atomic instructions implied a bus lock before the P6 family of x86 processors. Starting with P6, the cache line is locked instead.Winstead
Yeah, I thought you were talking about locked instructions that do an atomic RMW. Aligned mov is an atomic instruction (Why is integer assignment on a naturally aligned variable atomic on x86?). If we're just talking about x86, Intel's documentation still just says it asserts a LOCK# signal, because the internals only matter for perf. (e.g. scalability of multiple cores each doing an atomic RMW on a different cache line, and latency of not going off-core). Cache-line-split locks are extremely slow, bus lock.Stane
Summing up, I did not mean to imply that a cache line lock is necessary, but might be required to guarantee atomicity depending on sizeof(T) and the target platform. The main point was: The implementation might cause an ordinary load/store to be generated (one end) or an actual high-level mutual exclusion primitive might be required (the other end of the spectrum).Winstead
Yeah, I thought so, but I found the phrasing clunky, and worried it could be confusing to readers who didn't already understand the CPU-architecture details, and how compilers implemented pure-load / pure-store for std::atomic for sizes small enough to be "always_lock_free".Stane
Overall good answer, although perhaps redundant with the existing answers on Multithreading program stuck in optimized mode but runs normally in -O0. Perhaps you'd want to post it there, although that question uses a different example, and you intentionally chose int64_t so you could use a 32-bit ISA to make a point about atomicity. Anyway, I'll edit and try to improve the phrasing of a couple things I brought up, and link other existing Q&As to support some relevant points you make.Stane
Maybe that's the misunderstanding and my mistake in not being more verbose. When I say atomic instruction in respect to x86, I refer to using the LOCK prefix. You're right that an aligned mov on x86 is atomic, but I wouldn't consider it to be an "atomic instruction" in the sense of a category of instructions that involve either a cache or bus lock and a come with a significant overhead.Winstead
So yes. what my text is missing here is the differentiation between atomic instructions as in an aligned mov (pure load/store) and atomic instructions as in read-modify-write instructions that require a cache line or bus lock - at least in case of x86.Winstead
I think in the context of answering a question about std::atomic::load, it's best to use locked instruction when that's what you mean. "atomic instruction" is way more ambiguous. I assumed you were using it as a clumsy way to describe atomic RMW instructions, but readers that don't already understand this might not. If you want to include stuff like LL/SC retry loops for non-x86 ISAs, "atomic RMW instructions" would still include that, but "locked instructions" wouldn't. Your phrasing still strangely excludes LL/SC instructions because they don't take a cache or bus lock. :/Stane
Ok, I made a fairly substantial edit. You might want to tighten things back up and / or put them into your own words, if you liked your phrasing or formatting better for any of the parts I changed.Stane
@PeterCordes Thanks for the edit. The added references and the clarification in respect to the atomicity implementation definitely add value to the answer! Considering the minimal complexity of the assembly, the added comments might not have been necessary, but now it's definitely easier for the reader to follow the train of thought. Thanks again for taking the time to review and edit the answer!Winstead
D
1

If you do not use explicit sharing construct, like those you mention, it is undefined when main() will see shared having a value of 42: please see "Optimisations and reordering" below. Even if your test does not see a problem: please check out "About your test" below!

In multi-threading, a test that gives the "right" answer is (almost) never a proof of correctness.

A "successful" test is at most anecdotal evidence There is simply too much to take into account, like:

  • The memory model: what is guaranteed and, more likely: what not!
  • Optimisations by compiler and CPU
  • Scheduling. For example, thread can terminate anywhere between just before the while loop and inside the thread.join() function.
  • Run-time stuff like how many other threads and programs are running, how heavily the memory is used etc. This is both hardware and operating system dependent.
  • More things I forgot …

The only things you CAN trust, are the guarantees that your language's memory model gives.

Fortunately, C++ has a memory model since C++11!

Unfortunately, that model does not give much guarantees. The compiler can generate code that is allowed to do anything, for as long as the semantics of the program do not change as seen from a single-treaded perspective. That includes omitting code, postponing code or changing the order in which things happen. The only exceptions are when you make guaranteed progress, or when you use use explicit sharing constructs, like those you mentioned.

Debugging a multi-threaded situation is also extremely hard. Adding "debug code" to debug your program often changes its behaviour. For example, writing something to the standard output does I/O, which ensures progress. That can cause values to be visible by other threads, where that normally would not be the case!

Make sure you find out what the constructs you mention like atomics, volatile and mutexes do. That way, you can build programs that behave perfectly predictable in multi-threaded circumstances.

About your test

For the fun of it, let's explore some interesting cases surrounding your test program.

Thread scheduling

The operating system decides when threads run and terminate.

It is perfectly acceptable that thread is already terminated even before the while loop in main() is executed. Because thread termination is progress, shared might end up where main() can see it, before the while loop. In that case, the test seems successful. But if the scheduling is any different, the test might fail. You should never rely on scheduling.

Hence, even if your test does not see a problem, that is at most anecdotal evidence.

Optimisations and reordering

As @horsts excellent answer already indicates, the compiler and the CPU can optimise you code. Anything is allowed, as long as the program semantics do not change from the perspective of a single thread.

Imagine that you assign to a variable that you never read again in that thread (like you do in thread). The compiler can postpone the actual assignment for as much as it wants, as there is nothing that depends on the value of shared in that thread for as far as the compiler can see. You must have guaranteed progress in your thread to ensure that actual assignment. In your example, this progress is only guaranteed when thread is terminated: likely at the end of the thread-function. Then again: you have no idea when the thread is scheduled to call your function.

Using constructs like atomic<> and volatile force the compiler to generate code that does ensure predictable behaviour. If you know how to use them, you can make programs that can be shown to behave correctly in multi-threaded circumstances.

Dilatant answered 14/9, 2021 at 13:49 Comment(7)
However, even when the assignment does happen, it might never reach main memory. - But that won't stop other threads from seeing it. Systems that we can run multiple threads on have coherent cache. For CPU-architecture details, see When to use volatile with multi threading? (basically never), and Myths Programmers Believe about CPU Caches. It's normal for shared variables to stay hot in shared L3 cache, not actually getting written to/from slow DRAM.Stane
Yes, executing str reg, (mem) doesn't make it visible instantly, all modern CPUs worth putting into an SMP system have a store buffer. But a store buffer isn't like a write-back cache; it always tries to drain itself as quickly as possible, to make room for bursts of cache-miss stores to happen later. If an asm store instruction executes, other cores doing a load from the same address will see the new value within a few microseconds at worst, often more like 40ns. (Still many clock cycles). The question is compiler optimization like dead-store elimination or keeping a value in a register.Stane
(Other than that one misconception, nice answer. Would love to upvote once that's changed, but misinformation about CPU cache coherency is one of my pet peeves.)Stane
@PeterCordes maybe "assignment" is the wrong word, because of other optimisations. But I agree and will remove the section.Dilatant
Yeah, assignment is a high-level concept that might be implemented by a store instruction, for any assignment that isn't optimized away / into registers, i.e. that can even become visible across threads. From context though, I think it was fairly clear you meant "asm store happens", so that word choice wasn't a problem. But still, probably removing that section was the best bet, instead of trying to rewrite and describe how CPUs work.Stane
I do feel from many questions asked here, that there should be a good answer, to a good question, that explains the whole story of ordering, happens-before/after relations, memory fences and cache coherence. I see many answers that have snippets and that makes for a lot of double yet partial answers. Is it naughty to ask a synthetic question to point this out? (synthetic meaning: "a question where the questioner knows the answer but asks anyway to help others").Dilatant
It's actually encouraged to post a self-answered Q&A, and yeah it can be ok if the question would be too broad to expect others to answer, as long as the answer is coherent. (There's even a UI option to post the answer along with the question, so people don't temporarily see a placeholder unanswered question. A couple of my few questions were posted that way, like int 0x80 on 64-bit code, and int->hex ASCII in asm / AVX-512.) Yeah, sounds fully reasonable and a good idea to try to write up a big-picture explanation, and I'd be happy to review it and maybe rewrite a paragraph or two.Stane

© 2022 - 2024 — McMap. All rights reserved.