Are seq-cst fences exactly the same as acq-rel fences in absence of seq-cst loads?
Asked Answered
E

3

6

I'm trying to understand the purpose of std::atomic_thread_fence(std::memory_order_seq_cst); fences, and how they're different from acq_rel fences.

So far my understanding is that the only difference is that seq-cst fences affect the global order of seq-cst operations ([atomics.order]/4). And said order can only be observed if you actually perform seq-cst loads.

So I'm thinking that if I have no seq-cst loads, then I can replace all my seq-cst fences with acq-rel fences without changing the behavior. Is that correct?

And if that's correct, why am I seeing code like this "implementation Dekker's algorithm with Fences", that uses seq-cst fences, while keeping all atomic reads/writes relaxed? Here's the code from that blog post:

std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);

void p0()
{
    flag0.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag1.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 0)
        {
            flag0.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 0)
            {
            }
            flag0.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
 
    // critical section


    turn.store(1,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag0.store(false,std::memory_order_relaxed);
}

void p1()
{
    flag1.store(true,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_seq_cst);

    while (flag0.load(std::memory_order_relaxed))
    {
        if (turn.load(std::memory_order_relaxed) != 1)
        {
            flag1.store(false,std::memory_order_relaxed);
            while (turn.load(std::memory_order_relaxed) != 1)
            {
            }
            flag1.store(true,std::memory_order_relaxed);
            std::atomic_thread_fence(std::memory_order_seq_cst);
        }
    }
    std::atomic_thread_fence(std::memory_order_acquire);
 
    // critical section


    turn.store(0,std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    flag1.store(false,std::memory_order_relaxed);
}
Eugeneeugenia answered 4/1, 2022 at 10:46 Comment(15)
It looks like it's merely combining thread fences for store (release) and load (acquire) into a single full memory barrier, thus ensuring that the flag0 change is committed to memory before loading flag1 from memory. I don't imagine there would be a problem with replacing a seq-cst barrier with a rel-acq barrier pair.Unthrone
@Unthrone Looking at the first fence in each function, it seems neither seq-cst nor acq-rel fence would do the right thing, and the only option is to (remove the fence and) use seq-cst for the first store and the first load.Eugeneeugenia
But I'm not sure if it's a problem in the real world, or just a weakness of C++ memory model formalism.Eugeneeugenia
Just for the record godbolt.org/z/odjaorGcq shows that GCC compiles std::atomic_thread_fence(std::memory_order_acq_rel); to zero instructions for x86, rather than an mfence or equivalent, so it's not preventing StoreLoad reordering. So if your hypothesis is right, that would mean nothing (?) in a program would need to block StoreLoad reordering across acq_rel fences to satisfy any of the ISO C++ requirements that apply to programs without SC loads.Heise
Are you counting the read part of SC RMWs as part of "SC loads"? A failed compare_exchange_weak with seq_cst fail order, might be relevant, or might just be equivalent to fetch_add(0), I forget.Heise
@PeterCordes I'm very new this, so I don't have a definitive answer. I think yes, RMWs might count.Eugeneeugenia
IIRC, Decker's algorithm relies on blocking StoreLoad reordering so threads can read back what's actually in memory, not store-forward their own view. If that code is truly safe and guaranteed by ISO C++'s formal rules, then the fences aren't equivalent. The alternative explanation is that it just happens to work when compiled to asm for real ISAs (by compilers that don't optimize atomics, and thus will emit a fence instruction for a seq_cst C++ fence). I forget the details of how ISO C++ rules connect SC fences to "happens-before"s that govern what other threads are allowed to observe :/Heise
@PeterCordes I think it's the latter, it just happens to work. But I wonder if I'm approaching this from the wrong side, trying to learn the standard's abstraction first, and not what it actually compiles to.Eugeneeugenia
From my understanding, for the purposes of "happens before", acq-rel and seq-cst operations are equivalent. Seq-cst only adds extra constraints on what seq-cst loads are allowed to see and in what order.Eugeneeugenia
I definitely found it helpful to understand HW memory models like x86's that are defined in terms of accesses to a shared coherency domain that truly exists whether any readers are looking, with reordering allowed or not. e.g. preshing.com/20120913/acquire-and-release-semantics (And the fact that some ISAs, like PowerPC, allow private forwarding of data between cores, creating IRIW reordering). Then from there you can see how the C++ formalism is put together to give equivalent guarantees without reference to such details.Heise
(Also see preshing.com/20120710/…) One pitfall of that approach is that it took me a couple years to realize that a seq_cst store operation (not fence) doesn't have to be a full memory barrier on the spot, or to prevent reorder with later relaxed ops; ARM64 STLR's special interaction to not reorder with LDAR makes its asm seq_cst only just as strong as ISO C++ requires, not a lot stronger like most other ISAs, thus more efficient. See The strong-ness of x86 store instruction wrt. SC-DRF?Heise
Thanks, I'll take a look.Eugeneeugenia
So it appears the difference between seq-cst and acq-rel is simply that seq-cst is "global" when working with atomic variables. When it comes to using thread fences, surely acq-rel would effectively be global, since it's not concerned with any specific variables? Or is the compiler free to look at which variables are used around the fence and optimise accordingly?Unthrone
@Unthrone At least from the standard's point of view, an acq-rel fence effectively elevates any preceding atomic reads (most notably relaxed ones) to acquires, and any following atomic writes to releases. So at least theoretically, it's not global.Eugeneeugenia
@Den-Jason: A fence separate all possible future accesses before / after in the same thread, not just this function, so being non-global and instead promoting certain ops to acquire or release is only a very theoretical possibility for any real code that involves branches or loops.Heise
E
3

As I understand it, they're not the same, and a counterexample is below. I believe the error in your logic is here:

And said order can only be observed if you actually perform seq-cst loads.

I don't think that's true. In atomics.order p4 which defines the axioms of the sequential consistency total order S, items 2-4 all may involve operations which are not seq_cst. You can observe the coherence ordering between such operations, and this can let you infer how the seq_cst operations have been ordered.

As an example, consider the following version of the StoreLoad litmus test, akin to Peterson's algorithm:

std::atomic<bool> a,b;  // initialized to false

void thr1() {
    a.store(true, std::memory_order_seq_cst);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    if (b.load(std::memory_order_relaxed) == false)
        std::cout << "thr1 wins";
}

void thr2() {
    b.store(true, std::memory_order_seq_cst);
    std::atomic_thread_fence(std::memory_order_seq_cst);
    if (a.load(std::memory_order_relaxed) == false)
        std::cout << "thr2 wins";
}

Note all the loads are relaxed.

I claim that if thr1 prints "thr1 wins", then we deduce that a.store(true) preceded b.store(true) in the sequential consistency order S.

To see this, let A be b.load() and B be b.store(true). If b.load() == false then we have that A is coherence-ordered before B. (Apply atomics.order p3.3 with A,B as above, and X the initialization of b to false.)

Now let X be the fence in thr1. Then X happens before A by sequencing, so X precedes B in S by atomics.order p4.3. That is, the thr1 fence precedes b.store(true). And a.store(true), which is also seq_cst, precedes the thr1 fence, because the store strongly happens before the fence by sequencing. So by transitivity, a.store(true) precedes b.store(true), as claimed.

Similarly, if thr2 prints, then b.store(true) precedes a.store(true). They can't both precede each other, so we have therefore proved that the program cannot print both messages.

If you downgrade the fences to acq_rel, the proof breaks down. In that case, as far as I can see, nothing prevents the program from printing thr1 wins even if b.store(true) precedes a.store(true) in the order S. As such, with acq_rel fences, I believe it is allowed for both threads to print. Though I'm not sure whether there is any real implementation where it could actually happen.


We can get an even simpler example if we make all the loads and stores relaxed, so that the only seq_cst operations are the fences. Then we can use (4.4) instead to show that if b.load(relaxed) returns false, the thr1 fence precedes the thr2 fence, and vice versa if a.load() returns false. We therefore conclude, as before, that the program cannot print both messages.

However, if we keep the loads and stores relaxed and weaken the fences to acq_rel, it is then more clear that we have lost this guarantee. Indeed, with a little prodding, a similar example actually fails on x86, where the acq_rel fence is a no-op because the loads/stores are already acquire/release. So that's a clearer case where a seq_cst fence really achieves something that acq_rel does not.

Erelong answered 5/1, 2022 at 15:49 Comment(4)
Ok, I agree that if T1 prints thr1 wins, then the seq-cst order is a = true, T1 fence, b = true, T2 fence. But how does that order stop a.load(relaxed) in T2 from returning the old value of a before the assignment?Eugeneeugenia
@HolyBlackCat: Because if a.load(relaxed) in T2 returned false, the same logic would tell us that that the seq_cst order is instead b = true, T2 fence, a = true, T1 fence. It can't be both.Erelong
@HolyBlackCat: Or to say it another way, if the seq-cst order is a=true, T1 fence, b=true, T2 fence, then the contrapositive of (4.3) says that a.load(relaxed) is not coherence-ordered before a=true. So by (3.3), a.load cannot read the value stored by the initialization X. The only other value it can read is the one stored by a=true, namely true.Erelong
An acq_rel fence would definitely not be sufficient here; look at how it compiles for a real ISA like x86, where fence(acq_rel) compiles to zero asm instructions, not preventing StoreLoad reordering. The only question that doesn't answer is whether the the seq_cst fence preventing StoreLoad reordering on x86 is an implementation detail of it compiling to mfence, or if that's something the ISO standard actually guarantees. (I think the latter, although I didn't double-check your reasoning based on the standard. That sounds familiar from last I looked at the rules of how fences connect.)Heise
B
3

Before C++20, the C++ standard contained the following paragraphs to define visibility in terms of the single total order S (see [atomics.order]):

For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.

For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there is a memory_order_seq_cst fence X such that A is sequenced before X and B follows X in S, then B observes either the effects of A or a later modification of M in its modification order.

For atomic operations A and B on an atomic object M, where A modifies M and B takes its value, if there are memory_order_seq_cst fences X and Y such that A is sequenced before X, Y is sequenced before B, and X precedes Y in S, then B observes either the effects of A or a later modification of M in its modification order.

However, WG21/P0668R5 proposed two changes that were accepted for C++20: first, it weakens some guarantees of the single total order (it no longer needs to be consistent with "happens-before"), and second (and more importantly for this question), it strengthens the guarantees provided by seq-cst fences. For that they replaced Section 32.4 [atomics.order] paragraphs 3-7 (the definition of SC ordering) with the wording referenced in your question. P0668R5 states:

This new wording takes a significantly different approach with respect to the sequential consistency ordering S: Instead of defining visibility in terms of S and the other orders in the standard, this essentially defines constraints on S in terms of visibility in a particular execution, as expressed by the coherence order. If these constraints are not satisfiable by any total order S, then the candidate execution which gave rise to the coherence order is not valid.

To be honest I find this new definition to be less clear, but I suppose I just need to read through it a few more times and try to wrap my head around it. :) However, since this is supposed to strengthen the guarantees, the visibility guarantees described in previous standards should still hold.

Seq-cst fences are also part of the single total order S, so any two seq-cst fences are always totally ordered. Suppose we have a write operation A that is sequenced before a seq-cst fence X, and a read operation B that is sequenced after a seq-cst fence Y. If X is ordered before Y in S, then B is guaranteed to observe the value written by A, or some later value (this is based on the last paragraph from the first quote). An acq-rel fence does not provide any such guarantees.

Acq-rel fences basically convert surrounding relaxed load/stores into acquire/release operations, but acq-rel only provides visibility guarantees for surrounding operations once an acquire-load synchronizes with a release-store. That is, only once the load has already observed the value stored. But it does not provide any guarantee when the store becomes visible.

Betweenwhiles answered 6/1, 2022 at 18:26 Comment(7)
I've seen this proposal when dealing with this question, but apparently I stopped reading before they got to fences. :) As for the new definition, try applying it to the example in Nate's answer (and check out the comments).Eugeneeugenia
So.... presumably an acq-rel fence is equivalent to an acquire-fence immediately followed by a release-fence? But a release-fence immediately followed by an acquire-fence would behave differently; equivalent to a seq-cst fence?Unthrone
@Unthrone No, by combining acquire- and release-fences you can only get the same guarantees as an acq-rel-fence. Why do you think a release-fence immediately followed by an acquire-fence would behave differently?Betweenwhiles
@Betweenwhiles because you'd be ensuring the stores are committed to memory before ensuring the loads are read through. Which is presumably a better scenario than the other way round.Unthrone
I think there needs to be some guide written somewhere which identifies intended scenarios for the different load/store/fence types, since the existing documentation is clearly not lucid enough.Unthrone
@Unthrone basically an acquire-fence converts all atomic reads that are sequenced after the fence into acquire operations. Likewise, the release-fence converts all atomic writes that are sequenced before the fence into release operations. See [atomic.fences].Betweenwhiles
It can be helpful to experiment a bit and see how these things are translated to specific assembly code to get a better understanding. However, in general I recommend to avoid trying to reason about such translations, and instead focus on the definitions from the C++ standard. (Also see my answer here #64383173)Betweenwhiles
E
0

To summarize the argument made by Nate Eldredge in their answer:


The global seq-cst order by itself doesn't seem to affect anything other than seq-cst loads, but the order must exist even if there are no loads.

Certain operation reorderings (like, in Nate's example, both threads entering the critical section) would impose contradicting constraints on the seq-cst order, and are therefore not allowed.

I'm not entirely sure if this effect can be observed without seq-cst fences (i.e. only with seq-cst stores), I think no.

Eugeneeugenia answered 5/1, 2022 at 21:15 Comment(1)
Interesting to note that without seq_cst fences or seq_cst loads, ARMv8.3 compiler output would only involve stlr release stores, but nothing that would ever stop it from reordering with later loads. (An ldar seq_cst load, or a full barrier instruction). ARMv8 seq_cst is generally described as only being as strong as ISO C++ seq_cst requires, not stronger / more expensive, and with ARMv8.3 ldapr, acq_rel is similarly just enough. (Although unlike PowerPC, it's guaranteed not to do IRIW reordering: different readers will never disagree on the order of stores done by two other cores.)Heise

© 2022 - 2024 — McMap. All rights reserved.