For purposes of ordering, is atomic read-modify-write one operation or two?
Asked Answered
D

2

15

Consider an atomic read-modify-write operation such as x.exchange(..., std::memory_order_acq_rel). For purposes of ordering with respect to loads and stores to other objects, is this treated as:

  1. a single operation with acquire-release semantics?

  2. Or, as an acquire load followed by a release store, with the added guarantee that other loads and stores to x will observe both of them or neither?

If it's #2, then although no other operations in the same thread could be reordered before the load or after the store, it leaves open the possibility that they could be reordered in between the two.

As a concrete example, consider:

std::atomic<int> x, y;

void thread_A() {
    x.exchange(1, std::memory_order_acq_rel);
    y.store(1, std::memory_order_relaxed);
}

void thread_B() {
    // These two loads cannot be reordered
    int yy = y.load(std::memory_order_acquire);
    int xx = x.load(std::memory_order_acquire);
    std::cout << xx << ", " << yy << std::endl;
}

Is it possible for thread_B to output 0, 1?

If the x.exchange() were replaced by x.store(1, std::memory_order_release); then thread_B could certainly output 0, 1. Should the extra implicit load in exchange() rule that out?

cppreference makes it sound like #1 is the case and 0, 1 is forbidden:

A read-modify-write operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before or after this store.

But I can't find anything explicit in the standard to support this. Actually the standard says very little about atomic read-modify-write operations at all, except 31.4 (10) in N4860 which is just the obvious property that the read has to read the last value written before the write. So although I hate to question cppreference, I'm wondering if this is actually correct.

I'm also looking at how it's implemented on ARM64. Both gcc and clang compile thread_A as essentially

ldaxr [x]
stlxr #1, [x]
str #1, [y]

(See on godbolt.) Based on my understanding of ARM64 semantics, and some tests (with a load of y instead of a store), I think that the str [y] can become visible before the stlxr [x] (though of course not before the ldaxr). This would make it possible for thread_B to observe 0, 1. So if #1 is true then it would seem that gcc and clang are both wrong, which I hesitate to believe.

Finally, as far as I can tell, replacing memory_order_acq_rel with seq_cst wouldn't change anything about this analysis, since it only adds semantics with respect to other seq_cst operations, and we don't have any here.


I found What exact rules in the C++ memory model prevent reordering before acquire operations? which, if I understand it correctly, seems to agree that #2 is correct, and that 0, 1 could be observed. I'd still appreciate confirmation, as well as a check on whether the cppreference quote is actually wrong or if I'm misunderstanding it.

Disentomb answered 4/1, 2021 at 18:50 Comment(16)
thread_B performs a load on both x and y, but those are separate operations and as such, do not reflect the current state in thread_A. Regardless of any ordering, if x is loaded when thread_A has not done anything yet and y is loaded when thread_A has finished, you can get the 0,1 outputHexahedron
@LWimsey: Thanks for the correction, I think it is fixed now?Disentomb
Without a synchronize-with relationship, the C++ standard does not guarantee that 0,1 is forbidden, but based on how acquire operations enforce ordering, I don't believe 0,1 is possible in this case (edit #2).Hexahedron
@LWimsey: So the issue is that the y.store() in A doesn't synchronize with the y.load() in B, because 31.4 (2) only guarantees that if the y.store() were release? And of course if the y.store() were release then there would be no problem and we could definitely not get 0, 1.Disentomb
Yes, if the y.store in A is a release operation and B loads y=1 (acquire), then there would be a sync-with relationship (based on y) and B could never see x=0. The C++ memory model would then forbid the 0,1 outcome, but it does not because y.store is relaxed. However, based on how acquire operations work, you will not see 0,1Hexahedron
@LWimsey: That helps, thanks! But I'm confused what you mean about the last part, because as far as I understand the ARM64 implementation, I could see 0, 1, and I don't see what aspect of acquire operations would rule it out.Disentomb
A load acquire operation acts as a one-way barrier; operations sequenced after it cannot be reordered with it. Based on that, 0,1 should not be possible if the first operation in each thread is an acquire. It's tricky though because this is not how the standard defines acquire operations, it's just how they work.Hexahedron
@LWimsey: But that's my point: if #2 is correct, then only the load of x.exchange() has acquire semantics. I know we can't reorder it with anything that comes after it, but that's irrelevant to the behavior of the program. The question is whether we can reorder y.store() with the store of x.exchange(), and if the latter only has release semantics then its one-way barrier goes the wrong way to forbid such reordering.Disentomb
In thread A, y.store and x.exchange can get reordered IF the store does not use release AND the exchange does not use acquire. In your scenario, the exchange includes acquire, so the reordering won't happen. If the exchange uses only release semantics, reordering is possibleHexahedron
Yeah, I'm not suggesting that y.store() can be reordered before the entire x.exchange(). I'm asking whether y.store() can be reordered before the store half of the x.exchange(), so that y.store() happens "in the middle of" x.exchange() as it were. You seem to be suggesting that the acquire semantics apply to the entire x.exchange(), not only to the load half of it - if so, why? That's my question.Disentomb
I ignored those details on purpose and that's why this is a comment, not an answer. But in the C++ memory model, atomic operations have no in-between state, observable states are only before or after.Hexahedron
Regardless of C++ allowing it, hardware may not, IDK. In HW, I thought the store side of an LL/SC was essentially "glued" to the load in the order of operations that this core makes to its coherent L1d cache. But I'm not sure about a mechanism for that. Having stlxr access L1d cache directly, instead of just writing a store-buffer entry, would make it non-speculative, but making it drain the OoO back-end seems implausible. Maybe "invalidate queues" as discussed here? Or maybe HW can split.Docent
@PeterCordes: I have a test case that I'm pretty sure demonstrates ldaxr / stlxr / ldr being reordered. I can try to clean it up and post it later. Haven't been able to do the same with ldaxr / stlxr / str however.Disentomb
Ok, I guess my mental model was too simplistic, and some of my previous answers have unfortunately spread misinformation >.< Thanks for confirming with experimental test results, but now that I think carefully about it, it does make some sense. (Although less-than-seq_cst atomics can be pretty mind-bending. Probably a mistake to ever think one fully understands them across ISAs.)Docent
@PeterCordes: I have posted my test case as an answer, in case you are interested.Disentomb
Does this answer your question? Partial reordering of C++11 atomics on Aarch64Antipersonnel
A
5

From the standard's perspecitive, an RMW operation is a single operation. I'm not sure if it's explicitly stated anywhere, but it seems to be implied by it's name (which is singular) and some related wording.

Purely from the standard's perspecitive, your code can print 0, 1.

Firstly, the standard isn't worded in terms of operation reordering, but in terms of the synchronizes-with relationship between release and acquire operations.

Since y.load(acquire) doesn't have a matching release-or-stronger store (has nothing to synchronize with), it's as good as y.load(relaxed).

Since x.exchange(1, acq_rel) has only a matching load to sync with, but no stores, its "acquire" part doesn't do anything useful (is effectively relaxed), so it can be replaced with x.store(1, release).

Then we only have a potential synchronization between the store and load on x, but since there are no operations before the store and after the load (in the respective threads), this sync also doesn't do anything.

So both loads can return either 0 or 1.

Amadus answered 6/10, 2022 at 14:58 Comment(7)
In the general case, replacing x.exchange with x.store also requires that the return value is unused. Otherwise it could potentially matter that it observes the latest value in the modification order. So depending what you do with the return value, and what values this and other writers set, that might limit what you observe more than a separate load and then a store. Hmm, maybe that's a less interesting or more obvious point than I thought: if the return value was used, you'd have to invent a separate load as well as using .store for .exchange, which introduces obvious non-atomicity.Docent
I like your analysis and it's quite easy to understand. However I'd like to know if you are able to actually output (0, 1) on any machine.Infancy
@Infancy "on any machine" No idea. You can ask a new question, and hope that Peter Cordes sees it. :)Amadus
@TSK: Almost certainly not on an x86-like (with strong memory ordering semantics). But on ARM or other weakly ordered memory models? Probably.Coterminous
@HolyBlackCat, your analysis does not consider the description on cppreference for memory_order_acquire that "no reads or writes in the current thread can be reordered before this load". Assuming that cppreference is correct, it implies that even though there is no release-store operation to synchronise with, the acquire-load on y is not the same as a relaxed-load on y. Because of this, the output 0,1 should not be possible.Razor
@Razor As I said, the standard isn't worded in terms of reordering, so cppreference's "what can be reordered with what" is a layman explanation at best. This may or may not be good enough to reason about real hardware, but understanding whether it's the case is above my paygrade.Amadus
@HolyBlackCat, aside from the actual hardware behaviour, wouldn't the guarantee of "what can be reordered with what" be necessary for acquire-load and release-store? Example: Consider an implementation of a lock; it'll use acquire-load semantics for a successful lock and release-store semantics for the unlock. We must guarantee that instructions after the acquire-load are not reordered before it; otherwise, we would have a situation where the code's critical section is reordered before the acquire-load of the lock, which can cause a data race.Razor
D
4

Not an answer at the level of the language standard, but some evidence that in practice, the answer can be "two". And as I guessed in the question, this can happen even if the RMW is seq_cst.

I haven't been able to observe stores being reordered as in the original question, but here is an example that shows the store of an atomic seq_cst RMW being reordered with a following relaxed load.

The program below is an implementation of Peterson's algorithm adapted from LWimsey's example in What's are practical example where acquire release memory order differs from sequential consistency?. As explained there, the correct version of the algorithm involves

me.store(true, std::memory_order_seq_cst);
if (other.load(std::memory_order_seq_cst) == false) 
    // lock taken

where it is essential that the load become visible after the store.

If RMW were a single operation for the purposes of ordering semantics, we would expect that it would be safe to do

me.exchange(true, std::memory_order_seq_cst);
if (other.load(std::memory_order_relaxed) == false) {
    // Ensure critical section doesn't start until we know we have the lock
    std::atomic_thread_fence(std::memory_order_seq_cst);
    // lock taken
}

on the theory that since the exchange operation has acquire semantics, the load must become visible after the exchange has completed, and in particular after the store of true to me has become visible.

But in fact on ARMv8-a, using either gcc or clang, such code frequently fails. It appears that in fact, exchange does consist of an acquire-load and a release-store, and that other.load may become visible before the release-store. (Though not before the acquire-load of the exchange, but that is irrelevant here.)

clang generates code like the following:

mov w11, #1
retry:
ldaxrb wzr, [me]
stlxrb w12, w11, [me]
cbnz w12, retry
ldrb w11, [other]

See https://godbolt.org/z/fhjjn7, lines 116-120 of the assembly output. (gcc is the same but buried inside a library function.) By ARM64 memory ordering semantics, the release-store stlxrb can be reordered with following loads and stores. The fact that it's exclusive doesn't change that.

To make the reordering happen more often, we arrange for the data being stored to depend on a previous load that missed cache, which we ensure by evicting that line with dc civac. We also need to put the two flags me and other on separate cache lines. Otherwise, as I understand it, even if thread A does its load before the store, then thread B has to wait to begin its RMW until after A's store completes, and in particular won't do its load until A's store is visible.

On a multi-core Cortex A72 (Raspberry Pi 4B), the assertion typically fails after a few thousand iterations, which is nearly instantaneous.

The code needs to be built with -O2. I suspect it will not work if built for ARMv8.2 or higher, where swpalb is available.

// Based on https://mcmap.net/q/20576/-what-39-s-are-practical-example-where-acquire-release-memory-order-differs-from-sequential-consistency by LWimsey
#include <thread>
#include <atomic>
#include <cassert>

// size that's at least as big as a cache line
constexpr size_t cache_line_size = 256;

static void take_lock(std::atomic<bool> &me, std::atomic<bool> &other) {
    alignas(cache_line_size) bool uncached_true = true;
    for (;;) {
        // Evict uncached_true from cache.
        asm volatile("dc civac, %0" : : "r" (&uncached_true) : "memory");
        
        // So the release store to `me` may be delayed while
        // `uncached_true` is loaded.  This should give the machine
        // time to proceed with the load of `other`, which is not
        // forbidden by the release semantics of the store to `me`.
        
        me.exchange(uncached_true, std::memory_order_seq_cst);
        if (other.load(std::memory_order_relaxed) == false) {
            // taken!
            std::atomic_thread_fence(std::memory_order_seq_cst);
            return;
        }
        // start over
        me.store(false, std::memory_order_seq_cst);
    }
}

static void drop_lock(std::atomic<bool> &me) {
    me.store(false, std::memory_order_seq_cst);
}

alignas(cache_line_size) std::atomic<int> counter{0};

static void critical_section(void) {
    // We should be the only thread inside here.
    int tmp = counter.fetch_add(1, std::memory_order_seq_cst);
    assert(tmp == 0);
    
    // Delay to give the other thread a chance to try the lock
    for (int i = 0; i < 100; i++)
        asm volatile("");
    
    tmp = counter.fetch_sub(1, std::memory_order_seq_cst);
    assert(tmp == 1);
}    

static void busy(std::atomic<bool> *me, std::atomic<bool> *other)
{
    for (;;) {  
        take_lock(*me, *other);
        std::atomic_thread_fence(std::memory_order_seq_cst); // paranoia
        critical_section();
        std::atomic_thread_fence(std::memory_order_seq_cst); // paranoia
        drop_lock(*me);
    }
}


// The two flags need to be on separate cache lines.
alignas(cache_line_size) std::atomic<bool> flag1{false}, flag2{false};

int main()
{
    std::thread t1(busy, &flag1, &flag2);
    std::thread t2(busy, &flag2, &flag1);
    
    t1.join(); // will never happen
    t2.join();
    return 0;
}
Disentomb answered 18/2, 2021 at 18:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.