Does atomic read guarantees reading of the latest value?

Asked 28/10, 2018 at 14:4 Answered 29/7, 2022 at 19:52

Solved c++multithreading atomic volatile stdatomic

In C++ we have keyword volatile and atomic class. Difference between them is that volatile does not guarantee thread-safe concurrent reading and writing, but just ensures that compiler will not store variable's value in cache and instead will load variable directly from the memory, while atomic guarantees thread-safe concurrent reading and writing.

As we know, atomic read operation indivisible, i.e. neither thread can write new value to the variable while one or more threads reading variable's value, so I think that we always read the latest value, but I'm not sure :)

So, my question is: if we declare atomic variable, do we always get the latest value of the variable calling load() operation?

Ligan answered 28/10, 2018 at 14:4 Comment(9)

"but just ensures that compiler will not store variable's value in cache and instead will load variable from memory" - That's not it, really. It just ensures the value is accessed "strictly according to the rules of the abstract machine". The C++ standard and its abstract machine know nothing of caches. So you can't assume that there's no cache access. – Extol 28/10, 2018 at 14:7

this is not what atomic afford: it affords you will have a consistent value (that means you will always read a value that was actually set, not a transient value) But that does not mean the last value... the race remain, but at least values are good – Asperges 28/10, 2018 at 14:8

stackoverflow.com/questions/36496692/… – Jotunheim 28/10, 2018 at 14:8

@Asperges I think i got you, but in what cases we get a stale value, but not the latest value? – Ligan 28/10, 2018 at 14:16

without synchronization, thread can run in random order, thus one may be sleeping for a long time while another never stopped running; you may simulate this just adding sleep() in one thread (that is a common way to highlight races) – Asperges 28/10, 2018 at 14:20

@Asperges to be precise, it affords "immediate" consistency. read operation will eventually return the latest value no matter if it is atomic or not, therefore non-atomic read operation is referred to as "eventually consistent" – Surat 28/10, 2018 at 14:20

@Surat -- there is no requirement in the C++ language that non-atomic read operations will eventually return the latest value. If two or more threads are accessing the same object and at least one of those threads is modifying the object you have a data race; the behavior of a program that has a data race is undefined. It may well be that in practice you'll eventually see the latest value, but that's outside what the language definition provides. – Dunsany 28/10, 2018 at 14:55

This question is somewhat problematic because the concept of 'latest' is not well-defined for plain stores and loads. If thread 1 changes an atomic variable from 'A' to 'B' and a load in thread 2 returns the value 'A', either the load did not return the latest value or it was scheduled before the store.. there is no way you can tell. – Baptistery 28/10, 2018 at 20:46

@Baptistery If some thread changes a variable 'A' to 'B' than creates a file, and another thread sees the file and reads 'A', it's well defined that it isn't the latest value. But in general "latest" is ill defined. – Preston 11/11, 2018 at 7:32

When we talk about memory access on modern architectures, we usually ignore the "exact location" the value is read from.

A read operation can fetch data from the cache (L0/L1/...), the RAM or even the hard-drive (e.g. when the memory is swapped).

These keywords tell the compiler which assembly operations to use when accessing the data.

volatile

A keyword that tells the compiler to always read the variable's value from memory, and never from the register.

This "memory" can still be the cache, but, in case that this "address" in the cache is considered "dirty", meaning that the value has changed by a different processor, the value will be reloaded.

This ensures we never read a stale value.

Clarification: According to the standard, if the volatile type is not a primitive, whose read/write operations are atomic (in regard to the assembly instructions that read/write it) by nature, one might possibly read an intermediate value (the writer managed to write only half of the bytes by the time the reader read it). However, modern implementations do not behave this way.

atomic

When the compiler sees a load (read) operation, it basically does the exact same thing it would have done for a volatile value.

So, what is the difference???

The difference is cross-CPU write operations. When working with a volatile variable, if CPU 1 sets the value, and CPU 2 reads it, the reader might read an old value.

But, how can that be? The volatile keyword promises that we won't read a stale value!

Well, that's because the writer didn't publish the value! And though the reader tries to read it, it reads the old one.

When the compiler stumbles upon a store (write) operation for an atomic variable it:

Sets the value atomically in memory
Announces that the value has changed

After the announcement, all the CPUs will know that they should re-read the value of the variable because their caches will be marked "dirty".

This mechanism is very similar to operations performed on files. When your application writes to a file on the hard-drive, other applications may or may not see the new information, depending on whether or not your application flushed the data to the hard-drive.

If the data wasn't flushed, then it merely resides somewhere in your application's caches and visible only itself. Once you flush it, anyone who opens the file will see the new state.

Clarification: Common modern compiler & cache implementations ensure correct publishing of volatile writes as well. However, this is NOT a reason to prefer that over std::atomic. For example, just like some comments pointed out, Linux's atomic read and writes for x86_64 are implemented using volatiles.

Extraction answered 28/10, 2018 at 15:15 Comment(13)

"writer didn't publish the value" Does that mean that writer has modified value, but not has written value to variable (like read-modify-write steps)? And so reader will see old value because writer hasn't done write step? – Ligan 28/10, 2018 at 16:33

Yes, the writer will see the modified value, but other won't. A variable is syntactic sugar for programmers, so you can't actually say that the value "has not written value to variable", but you should rather say "has not published the new value to other CPUs". It's just like the flush() command for hard-disk operations. When you write something to a file, others don't see it because you didn't flush it to the hard-disk, instead, it sits somewhere around in your app's caches and visible only to itself. – Extraction 29/10, 2018 at 8:11

But, what if writer can not write value to variable in single instruction? What behaviour in that case, if there's two threads: first thread writes value to variable, and second thread reads value of this variable? Both operations are atomic. So I guess compiler will use cas-loops: If second thread can not atomically read value (because there's thread that writes value), it will try later in a loop, because atomic read guarantees that half-way values will not be returned calling load() function. Or maybe first thread will try to write value later, allowing second thread to read value? – Ligan 29/10, 2018 at 11:31

When the CPU performs an atomic operation, it always ensures that the write+publish are performed atomically. – Extraction 29/10, 2018 at 11:39

I mean, latest is not simple to understand. if reader has done read operation first, it was the latest value for the moment of reading, even if writer wrote new value later. So, returned value to first thread will be latest value relatively to first thread, but it will not be latest value to second, writer thread. Am I right? – Ligan 29/10, 2018 at 11:39

Latest value is absolute. At the moment of reading, at that atomic time in the space-time-continuum, the value that is the most-up-to-date, is the one that the reader will ultimately see. – Extraction 29/10, 2018 at 11:45

All real-world C++ implementations run threads across cores with coherent caches, so "publishing" happens automatically; MESI cache coherency require a core to invalidate other copies of a cache line to get exclusive ownership before it can modify (commit a store). This is why volatile worked in the past (and still does in the Linux kernel) as a roll-your-own memory_order_relaxed load or store. When to use volatile with multi threading? explains more. – Zippel 29/7, 2022 at 18:5

In ISO C++, data race on a volatile is undefined behaviour, so it's theoretically possible for a system like you describe to exist, but that isn't how any real systems work. (Not systems that std::thread will run across, anyway: there are systems with non-coherent shared memory between heterogeneous CPUs, for example.) – Zippel 29/7, 2022 at 18:6

@PeterCordes, a long time ago, I wrote real-time embedded C/C++ code with non-state-of-the-art compilers. Relying on features that are described as "undefined behaviour" is not my cup of tea. I've seen too many cases where code relied on "observed" behaviour and paid the price for that. If you have a feature that was built to solve a problem, use it, don't use some semi-defined non-straightforward way of implementing the same thing, just because you're smart. – Extraction 31/7, 2022 at 17:1

Yes, of course you should use atomic in modern C and C++ programs where it's available. Nowhere did I say otherwise. (Except for Linux kernel code, where you should follow the coding standards it uses, and use their macros which happen to be defined using volatile because that's effectively guarantted to work the way they want on the compilers they care about.) Anyway, your answer does try to define the behaviour of what would happen if you used volatile for multi-threading, but parts of what it says (about stale values) is incorrect for at least 99.999% of real systems, probably 100%. – Zippel 31/7, 2022 at 19:0

TL:DR: Please stop spreading misinformation about how real-world CPUs work. That's a useful thing for people to understand to reason about performance, and while debugging what happened when they did accidentally have a data race or something. (And BTW, my own answer on When to use volatile with multi threading? clearly starts with "never"; I wrote it to explain how CPUs work, and why volatile was usable in the bad old days before C++11 when we didn't have anything well-defined.) – Zippel 31/7, 2022 at 19:1

@PeterCordes, you are right. I was describing the hard definitions and guarantees of the standard, whereas the actual implementation might be more forgiving. I updated my answer to reflect that. – Extraction 2/8, 2022 at 21:31

Thanks, that's an improvement. BTW, an interesting example of GCC choosing to avoid tearing for volatile but not for an equivalent non-volatile is show in Nate's answer. Without volatile, GCC for AArch64 x = 0xdeadbeefdeadbeef; on a uint64_t uses stp (store-pair) of the same half twice. That's not guaranteed atomic before ARMv8.5, although in practice I'd guess it probably is for a 64-bit store of two halves. But with volatile, we get a single 64-bit str (godbolt.org/z/8vejMTeen), just like relaxed atomic<uint64_t> – Zippel 2/8, 2022 at 21:57

As ComicSansMS's answer says, "latest" requires some definition of simultanaeity. Understanding how hardware cache coherence works can give you a better idea of what you're going to get in practice and why the C++ standard doesn't technically guarantee the "latest value" for operations that aren't serialized. Atomic RMW operations on the same atomic variable are necessarily serialized, hence there is a "latest value" guarantee for those, but that doesn't make it better if you just need to read.

e.g. maybe 40 nanoseconds for a store in one core to invalidate (MESI) the cache line before it can commit its store, so no other cores have a cached value they can read. (Of course they could have loaded at some earlier time before the invalidate, with out-of-order exec, but that's a small time window and blocking it would hurt the common fast case a lot.)

There's also a C++ guarantee that a consistent modification order exists for each atomic variable separately. And if you've seen one value for that variable, later reads in the same thread are guaranteed to see that value or later. (Read Read coherence and so on, 6.9.2.2 : 19 intro.races in the standard.)

A load will see a very recent value if there are ongoing stores

If there was only one recent store, it will see it or not

On real systems, if it was longer ago than maybe 100 nanoseconds, or maybe a microsecond or two in really high contention cases, loads in other threads will see it. (Where the time of the store is what an rdtsc would have seen if you'd done one in the same thread as the store. i.e. before it even retires and sends out a request to other cores to invalidate their copies.)

i.e. I'm proposing a definition of simultanaeity where the writer and reader both run an rdtsc instruction within a few cycles of when their store and load executes in the out-of-order back end. That's very different from when readers can actually expect to see stores from other threads.

Even a seq_cst atomic RMW doesn't wait for other cores to drain their store buffers (or make it happen any faster) to make executed but not committed stores visible, so it's not fundamentally better.

Re: "latest value" concerns, see the following.

Another answer on this question suggests that stale data would be possible if the compilers didn't emit extra asm to explicitly "publish" stored data (make it globally visible). But all real systems have coherent cache across all the cores that C++ std::thread will start threads across. It's hypothetically possible to have std::thread run across cores with non-coherent shared memory, but would be extremely slow. See When to use volatile with multi threading? - never, obsoleted by C++11, but legacy code (and the Linux kernel) still use volatile to roll their own atomics.

Just a plain store instruction in assembly creates inter-core visibility because hardware is cache-coherent, using MESI. That's what you get from volatile. No "publish" is necessary. If you want this core to wait until the store is globally visible before doing later loads/stores, that's what a memory barrier does, to create ordering between this store and operations on other objects. Nothing to do with guaranteeing or speeding up visibility of this store.

The default std::memory_order is seq_cst; plain volatile is like relaxed on C++ implementations where it works for hand-rolled atomics. In ISO C++ volatile has undefined behaviour on data races, only atomic makes that safe. But real implementations, other than clang -fsanitize=thread or similar, don't do race detection.

Of course don't actually use volatile for threading. I mention this only to help understanding of how CPUs work, for thinking about performance and to help debugging accidental data races. C/C++11 made volatile obsolete for that purpose. Unless you're writing Linux kernel code (and then use their macros which just happen to use volatile under the hood).

Zippel answered 29/7, 2022 at 19:52 Comment(0)

if we declare atomic variable, do we always get the latest value of the variable calling load() operation?

Yes, for some definition of latest.

The problem with concurrency is that it is not possible to argue about order of events in the usual way. This comes from a fundamental limitation in the hardware where the only way to establish a global order of operations across multiple cores would be to serialize them (and eliminating all of the performance benefits of parallel computation in the process).

What modern processors provide instead is an opt-in mechanism to re-establish order between certain operations. Atomics are the language-level abstraction for that mechanism. Imagine a scenario in which two atomic<int>s a and b are shared between threads (and let's further assume they were initialized to 0):

// thread #1
a.store(1);
b.store(1);

// thread #2
while(b.load() == 0) { /* spin */ }
assert(a.load() == 1);

The assertion here is guaranteed to hold. Thread #2 will observe the "latest" value of a.

What the standard does not talk about is when exactly the loop will observe the value of b changing from 0 to 1. We know it will happen some time after the write by thread #1 and we also know it will happen after the write to a. But we don't know how long after.

This kind of reasoning is further complicated by the fact, that different threads are allowed to disagree when certain writes took place. If you switch to a weaker memory ordering, one thread may observe writes to distinct atomic variables happening in a different order than what is observed by another thread.

Campuzano answered 29/10, 2018 at 9:56 Comment(1)

"Yes, for some definition of latest." :) – Asperges 31/7, 2022 at 17:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++