Acquire/release semantics with 4 threads
Asked Answered
F

4

31

I am currently reading C++ Concurrency in Action by Anthony Williams. One of his listing shows this code, and he states that the assertion that z != 0 can fire.

#include <atomic>
#include <thread>
#include <assert.h>

std::atomic<bool> x,y;
std::atomic<int> z;

void write_x()
{
    x.store(true,std::memory_order_release);
}

void write_y()
{
    y.store(true,std::memory_order_release);
}

void read_x_then_y()
{
    while(!x.load(std::memory_order_acquire));
    if(y.load(std::memory_order_acquire))
        ++z;
}

void read_y_then_x()
{
    while(!y.load(std::memory_order_acquire));
    if(x.load(std::memory_order_acquire))
        ++z;
}

int main()
{
    x=false;
    y=false;
    z=0;
    std::thread a(write_x);
    std::thread b(write_y);
    std::thread c(read_x_then_y);
    std::thread d(read_y_then_x);
    a.join();
    b.join();
    c.join();
    d.join();
    assert(z.load()!=0);
}

So the different execution paths, that I can think of is this:

1)

Thread a (x is now true)
Thread c (fails to increment z)
Thread b (y is now true)
Thread d (increments z) assertion cannot fire

2)

Thread b (y is now true)
Thread d (fails to increment z)
Thread a (x is now true)
Thread c (increments z) assertion cannot fire

3)

Thread a (x is true)
Thread b (y is true)
Thread c (z is incremented) assertion cannot fire
Thread d (z is incremented)

Could someone explain to me how this assertion can fire?

He shows this little graphic: Image

Shouldn't the store to y also sync with the load in read_x_then_y, and the store to x sync with the load in read_y_then_x? I'm very confused.

EDIT:

Thank you for your responses, I understand how atomics work and how to use Acquire/Release. I just don't get this specific example. I was trying to figure out IF the assertion fires, then what did each thread do? And why does the assertion never fire if we use sequential consistency.

The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering. Once read_x_then_y sees this, it breaks out of the loop and reads y. Now, 2 things could happen. In one option, the write_y has written to y, meaning this release will sync with the if statement (load) meaning z is incremented and assertion cannot fire. The other option is if write_y hasn't run yet, meaning the if condition fails and z isn't incremented, In this scenario, only x is true and y is still false. Once write_y runs, the read_y_then_x breaks out of its loop, however both x and y are true and z is incremented and the assertion does not fire. I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?

Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.

Fishman answered 22/1, 2018 at 14:31 Comment(4)
The example comes from this page en.cppreference.com/w/cpp/atomic/… and the explanation is there too and has to do with the fact that the compiler is free to reorder instructions when optimizing unless you provide the right semantics - even if it may not be doing so experimentally in the field with any particular implementation. You have to provide memory_order_seq_cst to avoid the assertion.Job
"Has written to y" does not mean this write is visible in the current thread.Kolivas
@Fishman Answer to your EDIT: Look at e.g. thread A. The guarantee you get by release-acquire semantics on x is that any store in thread A done before releasing x is visible to the thread acquiring x. Since the store to y is done in a different thread, you can get the scenario where the load of y in thread C is reordered before the acquire of x.Countermove
Related or duplicate: Will two atomic writes to different locations in different threads always be seen in the same order by other threads? My answer there explains how this can happen on real POWER hardware, and other answers explain that the C++ memory model allows it.Anglophile
X
23

You are thinking in terms of sequential consistency, the strongest (and default) memory order. If this memory order is used, all accesses to atomic variables constitute a total order, and the assertion indeed cannot be triggered.

However, in this program, a weaker memory order is used (release stores and acquire loads). This means, by definition that you cannot assume a total order of operations. In particular, you cannot assume that changes become visible to other threads in the same order. (Only a total order on each individual variable is guaranteed for any atomic memory order, including memory_order_relaxed.)

The stores to x and y occur on different threads, with no synchronization between them. The loads of x and y occur on different threads, with no synchronization between them. This means it is entirely allowed that thread c sees x && ! y and thread d sees y && ! x. (I'm just abbreviating the acquire-loads here, don't take this syntax to mean sequentially consistent loads.)

Bottom line: Once you use a weaker memory order than sequentially consistent, you can kiss your notion of a global state of all atomics, that is consistent between all threads, goodbye. Which is exactly why so many people recommend sticking with sequential consistency unless you need the performance (BTW, remember to measure if it's even faster!) and are certain of what you are doing. Also, get a second opinion.

Now, whether you will get burned by this, is a different question. The standard simply allows a scenario where the assertion fails, based on the abstract machine that is used to describe the standard requirements. However, your compiler and/or CPU may not exploit this allowance for one reason or another. So it is possible that for a given compiler and CPU, you may never see that the assertion is triggered, in practice. Keep in mind that a compiler or CPU may always use a stricter memory order than the one you asked for, because this can never introduce violations of the minimum requirements from the standard. It may only cost you some performance – but that is not covered by the standard anyway.

UPDATE in response to comment: The standard defines no hard upper limit on how long it takes for one thread to see changes to an atomic by another thread. There is a recommendation to implementers that values should become visible eventually.

There are sequencing guarantees, but the ones pertinent to your example do not prevent the assertion from firing. The basic acquire-release guarantee is that if:

  • Thread e performs a release-store to an atomic variable x
  • Thread f performs an acquire-load from the same atomic variable
  • Then if the value read by f is the one that was stored by e, the store in e synchronizes-with the load in f. This means that any (atomic and non-atomic) store in e that was, in this thread, sequenced before the given store to x, is visible to any operation in f that is, in this thread, sequenced after the given load. [Note that there are no guarantees given regarding threads other than these two!]

So, there is no guarantee that f will read the value stored by e, as opposed to e.g. some older value of x. If it doesn't read the updated value, then also the load does not synchronize with the store, and there are no sequencing guarantees for any of the dependent operations mentioned above.

I liken atomics with lesser memory order than sequentially consistent to the Theory of Relativity, where there is no global notion of simultaneousness.

PS: That said, an atomic load cannot just read an arbitrary older value. For example, if one thread performs periodic increments (e.g. with release order) of an atomic<unsigned> variable, initialized to 0, and another thread periodically loads from this variable (e.g. with acquire order), then, except for eventual wrapping, the values seen by the latter thread must be monotonically increasing. But this follows from the given sequencing rules: Once the latter thread reads a 5, anything that happened before the increment from 4 to 5 is in the relative past of anything that follows the read of 5. In fact, a decrease other than wrapping is not even allowed for memory_order_relaxed, but this memory order does not make any promises to the relative sequencing (if any) of accesses to other variables.

Xylophone answered 22/1, 2018 at 16:30 Comment(8)
But why does thread c and thread d see that? Doesn't acquire/release make its changes to variables visible to other threads?Fishman
I think the last bullet is what fixed my misunderstanding. I thought that if you release a variable then it forces every thread with the updated value. It makes alot more sense now. Even though that x was synchronized (in read_x_then_y), and y was written to that thread doesn't have to see the new y value, because it isn't a synced relationship. Thank you so much!Fishman
Even if a thread writes to an atomic using release, it does not guarantee that it is seen by other threads, but IF it is THEN it will participate in a synchronized-with relationship. And sequential consistency is the one that makes all the thread see the latest value. Am I correct in assuming this?Fishman
You're welcome! "And sequential consistency is the one that makes all the thread see the latest value." – Not quite… there is still no simultaneous time between threads. What seq. cons. essentially means is that the order in which updates occur or become visible is consistent among threads – as if all atomic accesses were interleaved on one thread. There can still be a measurable delay however, and this delay can even differ between threads. So, x && y on one thread, and x && !y (or vice versa, but not both during the same execution) on another thread is still possible (for some time).Xylophone
@ArneVogel coming from java, where this is mainly abstracted away, I simply love this answer! sometimes I wish release/acquire semantics where available on a wider range for java-devs, besides Unsafe (which is an internal class), even if that would mean that a lot more bad code is spreadFactious
An excellent answer. Far too often, I read posts that claim immediate visibility of published values, as if there is some sort of temporal synchronization across processors and threads. This notion often rears its ugly head during interviews. Well done.Matriculation
"Only a total order on each individual variable is guaranteed for any atomic memory order". Can you clarify this statement?Bunkum
@Bunkum I can try. If you have two atomics x and y, then, for any memory order, all accesses to x form a total order, and all accesses to y form a total order, but these orders don't interact with one another (except for sequentially consistent memory order). Silly little analogy: Let's say Sally showered first and then had breakfast. Harry had a cup of coffee and then showered. Each person's schedule would form a total order in this example. But you can't say who showered first, for example, because these individual schedules don't interact.Xylophone
V
5

The release-acquire synchronization has (at least) this guarantee: side-effects before a release on a memory location are visible after an acquire on this memory location.

There is no such guarantee if the memory location is not the same. More importantly, there's no total (think global) ordering guarantee.

Looking at the example, thread A makes thread C come out of its loop, and thread B makes thread D come out of its loop.

However, the way a release may "publish" to an acquire (or the way an acquire may "observe" a release) on the same memory location doesn't require total ordering. It's possible for thread C to observe A's release and thread D to observe B's release, and only somewhere in the future for C to observe B's release and for D to observe A's release.


The example has 4 threads because that's the minimum example you can force such non-intuitive behavior. If any of the atomic operations were done in the same thread, there would be an ordering you couldn't violate.

For instance, if write_x and write_y happened on the same thread, it would require that whatever thread observed a change in y would have to observe a change in x.

Similarly, if read_x_then_y and read_y_then_x happened on the same thread, you would observe both changed in x and y at least in read_y_then_x.

Having write_x and read_x_then_y in the same thread would be pointless for the exercise, as it would become obvious it's not synchronizing correctly, as would be having write_x and read_y_then_x, which would always read the latest x.


EDIT:

The way, I am reasoning about this is that if thread a (write_x) stores to x then all the work it has done so far is synced with any other thread that reads x with acquire ordering.

(...) I can't think of any 'run' or memory ordering where z is never incremented. Can someone explain where my reasoning is flawed?

Also, I know The loop read will always be before the if statement read because the acquire prevents this reordering.

That's sequentially consistent order, which imposes a total order. That is, it imposes that write_x and write_y both be visible to all threads one after the other; either x then y or y then x, but the same order for all threads.

With release-acquire, there is no total order. The effects of a release are only guaranteed to be visible to a corresponding acquire on the same memory location. With release-acquire, the effects of write_x are guaranteed to be visible to whoever notices x has changed.

This noticing something changed is very important. If you don't notice a change, you're not synchronizing. As such, thread C is not synchronizing on y and thread D is not synchronizing on x.

Essentially, it's way easier to think of release-acquire as a change notification system that only works if you synchronize properly. If you don't synchronize, you may or may not observe side-effects.

Strong memory model hardware architectures with cache coherence even in NUMA, or languages/frameworks that synchronize in terms of total order, make it difficult to think in these terms, because it's practically impossible to observe this effect.

Villegas answered 22/1, 2018 at 15:59 Comment(3)
The problem is that if one thread doesn't see the second variable's update, then the other thread will. I have added an edit on my 'thinking' of how the code will run. Could you please tell me what's wrong with it?Fishman
So pretty much you are saying is that, If a thread writes to a variable x with release, then later another thread reads x with acquire (like the if statement), there is no guarantee that the second thread will read the correct value. And if it does (if we put it in a loop to keep checking). Then it 'syncs' up and all the writes with the first thread is visible?Fishman
Yes. However, if all writes are release and all reads are acquire, the value you read will always be correct somewhere in time; that is, you'll always read a value that was written, not a random, undefined or incorrect value.Villegas
J
0

Let's walk through the parallel code:

void write_x()
{
    x.store(true,std::memory_order_release);
}

void write_y()
{
    y.store(true,std::memory_order_release);
}

There is nothing before these instructions (they are at start of parallelism, everything that happened before also happened before other threads) so they are not meaningfully releasing: they are effectively relaxed operations.

Let's walk through the parallel code again, nothing that these two previous operations are not effective releases:

void read_x_then_y()
{
    while(!x.load(std::memory_order_acquire)); // acquire what state?
    if(y.load(std::memory_order_acquire))
        ++z;
}

void read_y_then_x()
{
    while(!y.load(std::memory_order_acquire));
    if(x.load(std::memory_order_acquire))
        ++z;
}

Note that all the loads refer to variables for which nothing is effectively released ever, so nothing is effectively acquired here: we re-acquire the visibility over the previous operations in main that are visible already.

So you see that all operations are effectively relaxed: they provide no visibility (over what was already visible). It's like doing an acquire fence just after an acquire fence, it's redundant. Nothing new is implied that wasn't already implied.

So now that everything is relaxed, all bets are off.

Another way to view that is to notice that an atomic load is not a RMW operations that leaves the value unchanged, as a RMW can be release and a load cannot.

Just like all atomic stores are part of the modification order of an atomic variable even if the variable is an effective a constant (that is a non const variable whose value is always the same), an atomic RMW operation is somewhere in the modification order of an atomic variable, even if there was no change of value (and there cannot be a change of value because the code always compares and copies the exact same bit pattern).

In the modification order you can have release semantic (even if there was no modification).

If you protect a variable with a mutex you get release semantic (even if you just read the variable).

If you make all your loads (at least in functions that do more than once operation) release-modification-loads with:

  • either a mutex protecting the atomic object (then drop the atomic as it's now redundant!)
  • or a RMW with acq_rel order,

the previous proof that all operations are effectively relaxed doesn't work anymore and some atomic operation in at least one of the read_A_then_B functions will have to be ordered before some operation in the other, as they operate on the same objects. If they are in the modification order of a variable and you use acq_rel, then you have an happen before relation between one of these (obviously which one happens before which one is non deterministic).

Either way execution is now sequential, as all operations are effectively acquire and release, that is as operative acquire and release (even those that are effectively relaxed!).

Janka answered 18/11, 2019 at 15:21 Comment(10)
So now that everything is relaxed, all bets are off. Not quite: in standardese 6.9.2.1 12: "An evaluation A strongly happens before an evaluation D if, either ... A is sequenced before D, or ..." In this case, two loads in the same thread in separate statements are sequenced. That means " Note: Informally, if A strongly happens before B, then A appears to be evaluated before B in all contexts." For ordered acquire loads, that rules out LoadLoad reordering. Although perhaps the as-if rule can still allow the compiler to do whole-program optimization and decide what they "load".Anglophile
quoted from eel.is/c++draft/intro.multithread#intro.races-12. But anyway, the C++ standard doesn't talk about any "global state" that loads read from. Anyway, we don't need LoadLoad reordering for this, just IRIW reordering which can happen in practice on PowerPC. Without seq-cst, it's allowed that different threads can't agree on a global total order of stores to different objects, even if they control for load ordering. Stores are allowed to become visible to some threads before all threads, although the majority of ISAs only do all threadsAnglophile
@PeterCordes "then A appears to be evaluated before B in all contexts" yeah I have no idea what that means so...Janka
That's in an an "informal" note. It seems pretty clear that the implication is that any possible way of observing the ordering would see the guaranteed order. But beyond that note, we'd have to look in the rest of the standard for any normative language that refers back to that definition. e.g. eel.is/c++draft/atomics.order#4 uses "A strongly happens before B" as part of a requirement it's describing.Anglophile
Anyway, we can probably weasel a justification for almost anything out of this example, but the clear purpose of the example is to demonstrate IRIW reordering: non-seq_cst readers don't have to agree on the order of two independent stores. (In theory the stores probably have to be seq_cst as well for that guarantee to hold). Even if there was enough other code to make the acquire operations meaningful, like we'd expect from preshing.com/20120913/acquire-and-release-semantics which show them in terms of all actions by a thread happening in some order.Anglophile
@PeterCordes Do you claim that IRIW reordering is possible in all programs that don't use seq_cst ops?Janka
In all programs that have independent atomic stores by separate threads, yes. Of mainstream C++ ISAs, it's apparently only observable in practice on some POWER hardware (via store-forwarding between SMT threads). Many programs would never notice it, and much delay between the two loads would probably give time for a store to become globally visible. But yes, in pure ISO C++ acq_rel isn't enough to stop IRIW reordering; only seq-cst guarantees a global total order that all threads can agree on for operations on multiple objects regardless of target-ISA effects.Anglophile
@PeterCordes Does the "strongly happens before" thing apply to anything non seq_cst?Janka
I searched for the term in C++17 n4659. Other uses: All static initialization strongly happens before (4.7.1) any dynamic initialization. and other initialization / termination clauses. I think the quote I found using the term to say something about seq_cst was new language introduced since C++17, in the working draft which eel.is/c++draft serves up. But the definition applies to anything sequenced-before or synchronized-with, not just seq-cst. (i.e. not the producers in this IRIW case.)Anglophile
@PeterCordes This was too much for a comment section so I posted a new questionJanka
J
-1

If we change two if statements to while statements, it will make the code correct and z will be guaranteed to be equal to 2.

void read_x_then_y()
{
    while(!x.load(std::memory_order_acquire));
    while(!y.load(std::memory_order_acquire));
    ++z;
}

void read_y_then_x()
{
    while(!y.load(std::memory_order_acquire));
    while(!x.load(std::memory_order_acquire));
    ++z;
}
Jackboot answered 27/2, 2022 at 18:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.