Consider an atomic read-modify-write operation such as x.exchange(..., std::memory_order_acq_rel)
. For purposes of ordering with respect to loads and stores to other objects, is this treated as:
a single operation with acquire-release semantics?
Or, as an acquire load followed by a release store, with the added guarantee that other loads and stores to
x
will observe both of them or neither?
If it's #2, then although no other operations in the same thread could be reordered before the load or after the store, it leaves open the possibility that they could be reordered in between the two.
As a concrete example, consider:
std::atomic<int> x, y;
void thread_A() {
x.exchange(1, std::memory_order_acq_rel);
y.store(1, std::memory_order_relaxed);
}
void thread_B() {
// These two loads cannot be reordered
int yy = y.load(std::memory_order_acquire);
int xx = x.load(std::memory_order_acquire);
std::cout << xx << ", " << yy << std::endl;
}
Is it possible for thread_B
to output 0, 1
?
If the x.exchange()
were replaced by x.store(1, std::memory_order_release);
then thread_B
could certainly output 0, 1
. Should the extra implicit load in exchange()
rule that out?
cppreference makes it sound like #1 is the case and 0, 1
is forbidden:
A read-modify-write operation with this memory order is both an acquire operation and a release operation. No memory reads or writes in the current thread can be reordered before or after this store.
But I can't find anything explicit in the standard to support this. Actually the standard says very little about atomic read-modify-write operations at all, except 31.4 (10) in N4860 which is just the obvious property that the read has to read the last value written before the write. So although I hate to question cppreference, I'm wondering if this is actually correct.
I'm also looking at how it's implemented on ARM64. Both gcc and clang compile thread_A
as essentially
ldaxr [x]
stlxr #1, [x]
str #1, [y]
(See on godbolt.) Based on my understanding of ARM64 semantics, and some tests (with a load of y
instead of a store), I think that the str [y]
can become visible before the stlxr [x]
(though of course not before the ldaxr
). This would make it possible for thread_B
to observe 0, 1
. So if #1 is true then it would seem that gcc and clang are both wrong, which I hesitate to believe.
Finally, as far as I can tell, replacing memory_order_acq_rel
with seq_cst
wouldn't change anything about this analysis, since it only adds semantics with respect to other seq_cst
operations, and we don't have any here.
I found What exact rules in the C++ memory model prevent reordering before acquire operations? which, if I understand it correctly, seems to agree that #2 is correct, and that 0, 1
could be observed. I'd still appreciate confirmation, as well as a check on whether the cppreference quote is actually wrong or if I'm misunderstanding it.
thread_B
performs a load on bothx
andy
, but those are separate operations and as such, do not reflect the current state inthread_A
. Regardless of any ordering, ifx
is loaded whenthread_A
has not done anything yet andy
is loaded whenthread_A
has finished, you can get the0,1
output – Hexahedron0,1
is forbidden, but based on how acquire operations enforce ordering, I don't believe0,1
is possible in this case (edit #2). – Hexahedrony.store()
in A doesn't synchronize with they.load()
in B, because 31.4 (2) only guarantees that if they.store()
were release? And of course if they.store()
were release then there would be no problem and we could definitely not get0, 1
. – Disentomby.store
in A is a release operation and B loadsy=1
(acquire), then there would be a sync-with relationship (based ony
) and B could never seex=0
. The C++ memory model would then forbid the0,1
outcome, but it does not becausey.store
is relaxed. However, based on how acquire operations work, you will not see0,1
– Hexahedron0, 1
, and I don't see what aspect of acquire operations would rule it out. – Disentomb0,1
should not be possible if the first operation in each thread is an acquire. It's tricky though because this is not how the standard defines acquire operations, it's just how they work. – Hexahedronx.exchange()
has acquire semantics. I know we can't reorder it with anything that comes after it, but that's irrelevant to the behavior of the program. The question is whether we can reordery.store()
with the store ofx.exchange()
, and if the latter only has release semantics then its one-way barrier goes the wrong way to forbid such reordering. – Disentomby.store
andx.exchange
can get reordered IF the store does not use release AND the exchange does not use acquire. In your scenario, the exchange includes acquire, so the reordering won't happen. If the exchange uses only release semantics, reordering is possible – Hexahedrony.store()
can be reordered before the entirex.exchange()
. I'm asking whethery.store()
can be reordered before the store half of thex.exchange()
, so thaty.store()
happens "in the middle of"x.exchange()
as it were. You seem to be suggesting that the acquire semantics apply to the entirex.exchange()
, not only to the load half of it - if so, why? That's my question. – Disentombstlxr
access L1d cache directly, instead of just writing a store-buffer entry, would make it non-speculative, but making it drain the OoO back-end seems implausible. Maybe "invalidate queues" as discussed here? Or maybe HW can split. – Docentldaxr / stlxr / ldr
being reordered. I can try to clean it up and post it later. Haven't been able to do the same withldaxr / stlxr / str
however. – Disentomb