Is the transformation of fetch_add(0, memory_order_relaxed/release) to mfence + mov legal?

Asked 23/11, 2020 at 21:3 Answered 16/12, 2020 at 4:16

Solved c++llvm x86-64 memory-model stdatomic

The paper N4455 No Sane Compiler Would Optimize Atomics talks about various optimizations compilers can apply to atomics. Under the section Optimization Around Atomics, for the seqlock example, it mentions a transformation implemented in LLVM, where a fetch_add(0, std::memory_order_release) is turned into a mfence followed by a plain load, rather than the usual lock add or xadd. The idea is that this avoids taking exclusive access of the cacheline, and is relatively cheaper. The mfence is still required regardless of the ordering constraint supplied to prevent StoreLoad reordering for the mov instruction generated.

This transformation is performed for such read-don't-modify-write operations regardless of the ordering, and equivalent assembly is produced for fetch_add(0, memory_order_relaxed).

However, I am wondering if this is legal. The C++ standard explicitly notes under [atomic.order] that:

Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

This fact about RMW operations seeing the 'latest' value has also been noted previously by Anthony Williams.

My question is: Is there a difference of behavior in the value the thread could see based on the modification order of the atomic variable, based on whether the compiler emits a lock add vs mfence followed by a plain load? Is it possible for this transformation to cause the thread performing the RMW operation to instead load values older than the latest one? Does this violate the guarantees of the C++ memory model?

Lentz answered 23/11, 2020 at 21:3 Comment(11)

The 'load release' in the form of fetch_add(0, std::memory_order_release) is correct and so is the LLVM optimization to 'mfence + load'. For a seqlock reader, it's probably cheaper to avoid the 'fetch_add' overhead and instead use a standalone acquire fence followed by a relaxed load (not a single 'load acquire' operation) This guarantees the desired ordering and is cheaper because it avoids the 'mfence' and/or cache line locking. It is not equivalent to 'mfence + load', but it works since it applies to the part where the seqlock only performs loads. – Ciao 25/11, 2020 at 22:11

Thanks. I'm confused whether RMW is required to read the latest value in the modification order of the variable (and if e.g. buffered stores on x86 constitute the modification order or not). With lock add, the last store made from some other thread would become visible due to the invalidation of the line caused by changing its status to exclusive, but now the plain load could still see an older value not yet visible. Is this discrepancy violating that clause in the standard? The question is about what constitutes the modification order, whose 'latest' value is to be fetched. – Lentz 25/11, 2020 at 22:21

Latest in the modification order is only meaningful in a particular context, which is RMW operations in multiple threads working on the same atomic variable. The RMW load and store parts operate on the latest in the modification order which guarantees that you won't 'miss' updates. For example, let 10 threads perform 1 million fetch_add(1) increments on the same variable and the result is 10 million. Replace that with single loads and stores and the result is probably less than 10 million (while still data-race-free)..... – Ciao 25/11, 2020 at 23:18

.... for plain loads and stores, the concept of 'latest in the modification order' is not well defined. If thread A performs a store, it does not become visible to fetch_add in thread B as long as the store buffer in A is not flushed. the load part of an RMW is very similar to a plain load (both read from the L1 cache), except that the RMW locks the cache line first whereas the load does not. If an RMW is used to only load a value, like fetch_add(0), 'mfence+load' is a correct transformation since it does not change the order of preceding operations wrt this load (including #StoreLoad) – Ciao 25/11, 2020 at 23:19

The comment became too long, but the transformation to 'mfence+load relaxed' only applies to an RMW release operation, i.e. fetch_add(0, mo_release). – Ciao 25/11, 2020 at 23:23

I see your point, but I still wonder when a normal store participates in the modification order of an object. Is it when the store becomes visible? If so, the standard then mentions that there is a single total order for every object, but on x86 each thread can see its writes before they are visible to others (sitting in the store buffer), hence each thread observing a different total modification order for the same object when normal stores are involved (due to store-load forwarding), and thinking they were first to modify the object. This would probably answer the original question ... – Lentz 26/11, 2020 at 2:10

... too, as the answer rests on whether the values sitting in the store buffer are already participating in the modification order (as the respective threads can observe them), because then a line invalidating lock add is still needed for this case so that they become visible to it and the 'latest' value in the order can be fetched, even if it's a fetch_add(0, ...). Otherwise, the only constraint for participation is visibility, and individual RMW operations already ensure it so that future ones in the sequence read the correct value, and hence the transformation would be legal. – Lentz 26/11, 2020 at 2:17

The store buffer does not change the modification order of a single variable. Let's say you have threads A, B and C and one atomic integer foo For the sake of argument, the store buffer in A takes 10 min to flush, B takes 20 min and C 30 min At time t=0, A performs store(1), B store(2) and C store(3). After that they continue to load values from foo. When t<10, all threads load foo from their own store buffer (A=1, B=2, C=3) ..... – Ciao 26/11, 2020 at 3:39

.... At t=10, store buffer A gets flushed and now A loads foo from the L1 cache, but still gets 1. threads B and C continue to perform a store buffer bypass and load values 2 and 3 respectively. At t=20, store buffer B is flushed which causes A and B to load 2, while C still loads 3. At t=30, store buffer C is flushed and all threads load 3. From the beginning, A has observed 1,2,3. B has observed 2,3 while C has only observed 3 This is fully compliant with the concept of single modification order – Ciao 26/11, 2020 at 3:40

Thanks, I think this answers my question. Just to confirm my understanding, when thread A executes an RMW operation where the visible sequence in the modification order is 1,2 for it at that point in time (2 for B is flushed, and still 3 in store buffer for C), I take it that the RMW operation takes the 'latest' value from the subsequence of the total modification order that is visible currently to A to update the value? – Lentz 26/11, 2020 at 16:43

The platform defines 'latest' as the value currently in memory (= L1 cache thanks to MESI cache coherency). The RMW in A updates 2->N and eventually, C stores 3. The modification order is now 1, 2, N, 3. You can say that C delays updating the latest in the modification order until the store buffer is flushed – Ciao 28/11, 2020 at 5:37

(I started writing this a while ago but got stalled; I'm not sure it adds up to a full answer, but thought some of this might be worth posting. I think @LWimsey's comments do a better job of getting to the heart of an answer than what I wrote.)

Yes, it's safe.

Keep in mind that the way the as-if rule applies is that execution on the real machine has to always produce a result that matches one possible execution on the C++ abstract machine. It's legal for optimizations to make some executions that the C++ abstract machine allows impossible on the target. Even compiling for x86 at all makes all IRIW reordering impossible, for example, whether the compiler likes it or not. (See below; some PowerPC hardware is the only mainstream hardware that can do it in practice.)

I think the reason that wording is there for RMWs specifically is that it ties the load to the "modification order" which ISO C++ requires to exist for each atomic object separately. (Maybe.)

Remember that the way C++ formally defines its ordering model is in terms of synchronizes-with, and existence of a modification order for each object (that all threads can agree on). Not like hardware where there is a notion of coherent caches¹ creating a single coherent view of memory that each core accesses. The existence of coherent shared memory (typically using MESI to maintain coherence at all times) makes a bunch of things implicit, like the impossibility of reading "stale" values. (Although HW memory models do typically document it explicitly like C++ does).

Thus the transformation is safe.

ISO C++ does mention the concept of coherency in a note in another section: http://eel.is/c++draft/intro.races#14

The value of an atomic object M, as determined by evaluation B, shall be the value stored by some side effect A that modifies M, where B does not happen before A.
[Note 14: The set of such side effects is also restricted by the rest of the rules described here, and in particular, by the coherence requirements below. — end note]

...

[Note 19: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note]

[Note 20: The value observed by a load of an atomic depends on the “happens before” relation, which depends on the values observed by loads of atomics. The intended reading is that there must exist an association of atomic loads with modifications they observe that, together with suitably chosen modification orders and the “happens before” relation derived as described above, satisfy the resulting constraints as imposed here. — end note]

So ISO C++ itself notes that cache coherence gives some ordering, and x86 has coherent caches. (I'm not making a complete argument that this is enough ordering, sorry. LWimsey's comments about what it even means to be the latest in a modification order are relevant.)

(On many ISAs (but not all), the memory model also rules out IRIW reordering when you have stores to 2 separate objects. (e.g. on PowerPC, 2 reader threads can disagree about the order of 2 stores to 2 separate objects). Very few implementations can create such reordering: if shared cache is the only way data can get between cores, like on most CPUs, that creates an order for stores.)

Is it possible for this transformation to cause the thread performing the RMW operation to instead load values older than the latest one?

On x86 specifically, it's very easy to reason about. x86 has a strongly-ordered memory model (TSO = Total Store Order = program order + a store buffer with store-forwarding).

Footnote 1: All cores that std::thread can run across have coherent caches. True on all real-world C++ implementations across all ISAs, not just x86-64. There are some heterogeneous boards with separate CPUs sharing memory without cache coherency, but ordinary C++ threads of the same process won't be running across those different cores. See this answer for more details about that.

Dart answered 16/12, 2020 at 4:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags