What do each memory_order mean?

Asked 10/9, 2012 at 6:39 Answered 4/1, 2022 at 22:32

Solved c++c++11 thread-safety atomic memory-model

118

I read a chapter and I didn't like it much. I'm still unclear what the differences is between each memory order. This is my current speculation which I understood after reading the much more simple http://en.cppreference.com/w/cpp/atomic/memory_order

The below is wrong so don't try to learn from it

memory_order_relaxed: Does not sync but is not ignored when order is done from another mode in a different atomic var
memory_order_consume: Syncs reading this atomic variable however It doesnt sync relaxed vars written before this. However if the thread uses var X when modifying Y (and releases it). Other threads consuming Y will see X released as well? I don't know if this means this thread pushes out changes of x (and obviously y)
memory_order_acquire: Syncs reading this atomic variable AND makes sure relaxed vars written before this are synced as well. (does this mean all atomic variables on all threads are synced?)
memory_order_release: Pushes the atomic store to other threads (but only if they read the var with consume/acquire)
memory_order_acq_rel: For read/write ops. Does an acquire so you don't modify an old value and releases the changes.
memory_order_seq_cst: The same thing as acquire release except it forces the updates to be seen in other threads (if a store with relaxed on another thread. I store b with seq_cst. A 3rd thread reading a with relax will see changes along with b and any other atomic variable?).

I think I understood but correct me if i am wrong. I couldn't find anything that explains it in easy to read english.

Scarbrough answered 10/9, 2012 at 6:39 Comment(12)

@JesseGood I read the first which didn't help much. The 2nd isn't even related. – Scarbrough 10/9, 2012 at 7:29

I doubt this will ever be "easy to read". Memory ordering is just inherently an very complicated and extremely subtle subject. I won't attempt to explain it better than this document. – Turnedon 10/9, 2012 at 7:32

@KerrekSB, the problem of that document (or of hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf which is another good introduction to the issues) is that their terminology isn't inline with the one used in the standard. – Theall 10/9, 2012 at 8:26

@AProgrammer: I think acquire order imposes a read barrier before the load, and release order imposes a write barrier after the store... Acquire ordering prevents sinking of the load, and release prevents rising of the store. – Turnedon 10/9, 2012 at 8:57

@KerrekSB: correct, I don't think that sentence is hard to understand. Now if you can explain how memory_order_seq_cst is different from memory_order_acq_rel i'd be happy ;) – Scarbrough 10/9, 2012 at 9:8

@acidzombie24: acqrel means "acquire for loads, release for stores" (e.g. for compare-exchange, where you get either a load or a store, depending on whether your expected value is still current). Sequential consistency implies an unconditional full barrier. – Turnedon 10/9, 2012 at 9:9

@acidzombie24 There is a total order. See stackoverflow.com/questions/12340773/… for a case where it matters. – Theall 10/9, 2012 at 9:11

@KerrekSB, I think the C++ model is weaker than that (i.e. you can achieve it this way but there are things which are allowed in C++ which can't happen with the use of Linux kernel's barriers). But to confirm I'd have to reread carefully the definitions of Linux kernel's barriers. – Theall 10/9, 2012 at 9:14

The GCC Wiki explains it much better, in my opinion. – Acuminate 10/9, 2012 at 12:7

@Damon: Thats the best explanation I have ever seen. I wish I seen it before – Scarbrough 10/9, 2012 at 12:29

@Damon: Would you consider make your comment an answer? The reason is that your comment provides useful information to anyone (like me) who later finds the question in Stackoverflow's archives. (Also, if you respond here so that I am notified, I'll upvote the answer.) – Cristincristina 30/12, 2012 at 12:50

@thb: Added an excerpt, see below. – Acuminate 30/12, 2012 at 15:13

121

The GCC Wiki gives a very thorough and easy to understand explanation with code examples.

(excerpt edited, and emphasis added)

IMPORTANT:

Upon re-reading the below quote copied from the GCC Wiki in the process of adding my own wording to the answer, I noticed that the quote is actually wrong. They got acquire and consume exactly the wrong way around. A release-consume operation only provides an ordering guarantee on dependent data whereas a release-acquire operation provides that guarantee regardless of data being dependent on the atomic value or not.

The first model is "sequentially consistent". This is the default mode used when none is specified, and it is the most restrictive. It can also be explicitly specified via memory_order_seq_cst. It provides the same restrictions and limitation to moving loads around that sequential programmers are inherently familiar with, except it applies across threads.
[...]
From a practical point of view, this amounts to all atomic operations acting as optimization barriers. It's OK to re-order things between atomic operations, but not across the operation. Thread local stuff is also unaffected since there is no visibility to other threads. [...] This mode also provides consistency across all threads.

The opposite approach is memory_order_relaxed. This model allows for much less synchronization by removing the happens-before restrictions. These types of atomic operations can also have various optimizations performed on them, such as dead store removal and commoning. [...] Without any happens-before edges, no thread can count on a specific ordering from another thread.
The relaxed mode is most commonly used when the programmer simply wants a variable to be atomic in nature rather than using it to synchronize threads for other shared memory data.

The third mode (memory_order_acquire / memory_order_release) is a hybrid between the other two. The acquire/release mode is similar to the sequentially consistent mode, except it only applies a happens-before relationship to dependent variables. This allows for a relaxing of the synchronization required between independent reads of independent writes.

memory_order_consume is a further subtle refinement in the release/acquire memory model that relaxes the requirements slightly by removing the happens before ordering on non-dependent shared variables as well.
[...]
The real difference boils down to how much state the hardware has to flush in order to synchronize. Since a consume operation may therefore execute faster, someone who knows what they are doing can use it for performance critical applications.

Here follows my own attempt at a more mundane explanation:

A different approach to look at it is to look at the problem from the point of view of reordering reads and writes, both atomic and ordinary:

All atomic operations are guaranteed to be atomic within themselves (the combination of two atomic operations is not atomic as a whole!) and to be visible in the total order in which they appear on the timeline of the execution stream. That means no atomic operation can, under any circumstances, be reordered, but other memory operations might very well be. Compilers (and CPUs) routinely do such reordering as an optimization.
It also means the compiler must use whatever instructions are necessary to guarantee that an atomic operation executing at any time will see the results of each and every other atomic operation, possibly on another processor core (but not necessarily other operations), that were executed before.

Now, a relaxed is just that, the bare minimum. It does nothing in addition and provides no other guarantees. It is the cheapest possible operation. For non-read-modify-write operations on strongly ordered processor architectures (e.g. x86/amd64) this boils down to a plain normal, ordinary move.

The sequentially consistent operation is the exact opposite, it enforces strict ordering not only for atomic operations, but also for other memory operations that happen before or after. Neither one can cross the barrier imposed by the atomic operation. Practically, this means lost optimization opportunities, and possibly fence instructions may have to be inserted. This is the most expensive model.

A release operation prevents ordinary loads and stores from being reordered after the atomic operation, whereas an acquire operation prevents ordinary loads and stores from being reordered before the atomic operation. Everything else can still be moved around.
The combination of preventing stores being moved after, and loads being moved before the respective atomic operation makes sure that whatever the acquiring thread gets to see is consistent, with only a small amount of optimization opportunity lost.
One may think of that as something like a non-existent lock that is being released (by the writer) and acquired (by the reader). Except... there is no lock.

In practice, release/acquire usually means the compiler needs not use any particularly expensive special instructions, but it cannot freely reorder loads and stores to its liking, which may miss out some (small) optimization opportuntities.

Finally, consume is the same operation as acquire, only with the exception that the ordering guarantees only apply to dependent data. Dependent data would e.g. be data that is pointed-to by an atomically modified pointer.
Arguably, that may provide for a couple of optimization opportunities that are not present with acquire operations (since fewer data is subject to restrictions), however this happens at the expense of more complex and more error-prone code, and the non-trivial task of getting dependency chains correct.

It is currently discouraged to use consume ordering while the specification is being revised.

Acuminate answered 30/12, 2012 at 15:12 Comment(7)

what do you mean by dependent variables in the case for memory_order_acquire/memory_order_release? – Major 8/5, 2017 at 16:37

@Acuminate Is the note about using consume ordering being discouraged still relevant? – Hyaluronidase 5/8, 2017 at 15:59

@tambre: Yes, see: isocpp.org/files/papers/p0636r0.html -- P0371R1 deprecates it for C++17 (temporarily). – Acuminate 5/8, 2017 at 21:34

relaxed is a bit more than just a plain normal read/write; it guarantees that read/writes are not "torn" which allows you to implement tear-free shared variables without any imposition from memory barriers. See Preshing's work on this here: preshing.com/20130618/atomic-vs-non-atomic-operations ...as well as an example of relaxed "flags" here with acquire/release fences to ensure separate data structures are properly committed to and read from memory: preshing.com/20130922/acquire-and-release-fences – Cincinnatus 17/6, 2019 at 13:8

@Hyaluronidase Compilers either did not bother w/ consume or tried to provide consume semantics and failed in many special cases (notably code like a[x-x]). Anyway consume had a pretty stupid spec: what is operator, excluded? – Senarmontite 31/10, 2019 at 10:15

prevents ordinary loads and stores from being reordered after the atomic operation I'm not sure I understand this wording: I guess this means from being moved to the part after the atomic operation ? (as opposed to forbidding any reordering for all times after the atomic operation happened) – Izanami 22/12, 2021 at 9:59

@Den-Jason: Equally importantly, atomic (with relaxed or anything else) guarantees visibility across threads, which you don't get with plain assignments. (Because that would be data-race UB, and compilers can assume that doesn't happen). Thus MCU programming - C++ O2 optimization breaks while loop / Multithreading program stuck in optimized mode but runs normally in -O0. Pre C++11, this is why volatile was part of hand-rolled atomics – Lamplighter 21/1, 2023 at 20:52

This is a quite complex subject. Try to read http://en.cppreference.com/w/cpp/atomic/memory_order several times, try to read other resources, etc.

Here's a simplified description:

The compiler and CPU can reorder memory accesses. That is, they can happen in different order than what's specified in the code. That's fine most of the time, the problem arises when different thread try to communicate and may see such order of memory accesses that breaks the invariants of the code.

Usually you can use locks for synchronization. The problem is that they're slow. Atomic operations are much faster, because the synchronization happens at CPU level (i.e. CPU ensures that no other thread, even on another CPU, modifies some variable, etc.).

So, the one single problem we're facing is reordering of memory accesses. The memory_order enum specifies what types of reorderings compiler must forbid.

relaxed - no constraints.

consume - no loads that are dependent on the newly loaded value can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.

acquire - no loads can be reordered wrt. the atomic load. I.e. if they are after the atomic load in the source code, they will happen after the atomic load too.

release - no stores can be reordered wrt. the atomic store. I.e. if they are before the atomic store in the source code, they will happen before the atomic store too.

acq_rel - acquire and release combined.

seq_cst - it is more difficult to understand why this ordering is required. Basically, all other orderings only ensure that specific disallowed reorderings don't happen only for the threads that consume/release the same atomic variable. Memory accesses can still propagate to other threads in any order. This ordering ensures that this doesn't happen (thus sequential consistency). For a case where this is needed see the example at the end of the linked page.

Lanceolate answered 10/9, 2012 at 12:1 Comment(5)

Your answer is good but seq_cst still is a bit confusing to me. Nevermind i think it clicked after i read the example for the 5th time. seq seems to enforce all threads see the value (immediately?) so two threads don't acquire updates in different orders – Scarbrough 10/9, 2012 at 12:14

ok. so for acq_rel: > The synchronization is established only between the threads releasing and acquiring the same atomic variable. Other threads can see different order of memory accesses than either or both of the synchronized threads. and for seq_cst: > The synchronization is established between all atomic operations tagged std::memory_order_seq_cst. All threads using such atomic operation see the same order of memory accesses. still not fully understanding this. but my question now is. is seq_cst on atomic variables faster than just using a mutex? – Michaeline 30/4, 2013 at 20:1

It depends. The only way to know is to measure. As a rule of thumb, if the lock contention is low, atomics usually are faster. – Lanceolate 30/4, 2013 at 21:28

Fabulous description. – Partida 16/9, 2019 at 9:42

The acquire/release description seems to be wrong. Both prevent reordering of any operations (not just loads or stores), but the difference is the reordering direction. Nothing can be reordered forward past an acquire operation, or back before a release operation. – Bolick 6/3, 2022 at 15:55

Firstly...

Things to ignore:

memory_order_consume - apparently no major compiler implements it, and they silently replace it with a stronger memory_order_acquire. Even the standard itself says to avoid it.

A big part of the cppreference article on memory orders deals with 'consume', so dropping it simplifies things a lot.

It also lets you ignore related features like [[carries_dependency]] and std::kill_dependency.

Data race: Writing to a non-atomic variable from one thread and reading/writing to it from a different thread, if nothing prevents the two actions from happening at the same time, is called a data race, and causes undefined behavior.

Race condition is not a synonym, it's just a loose word for an intermittent bug that depends on how fast the threads happen to execute. (The buggy program may or may not also have a data race.)

Memory orders:

memory_order_relaxed is the weakest and supposedly the fastest memory order.

Any reads/writes to atomics can't cause data races (and subsequent UB). relaxed provides just this minimal guarantee, for a single variable. It doesn't provide any guarantees for other variables (atomic or not).

All threads agree on the order of operations on every particular atomic variable. But it's the case only for invididual variables. If other variables (atomic or not) are involved, threads might disagree on how exactly the operations on different variables are interleaved.

It's as if relaxed operations propagate between threads with slight unpredictable delays.

This means that you can't use relaxed atomic operations to judge when it's safe to access other non-atomic memory (can't synchronize access to it).

By "threads agree on the order" I mean that:

Each thread will access each separate variable in the exact order you tell it to. E.g. a.store(1, relaxed); a.store(2, relaxed); will write 1, then 2, never in the opposite order. But accesses to different variables in the same thread can still be reordered relative to each other.
If a thread A writes to a variable several times, then thread B reads seveal times, it will get the values in the same order (but of course it can read some values several times, or skip some, if you don't synchronize the threads in other ways).
No other guarantees are given.

Example uses: Anything that doesn't try to use an atomic variable to synchronize access to non-atomic data: various counters (that exist for informational purposes only), or 'stop flags' to signal other threads to stop. Another example: operations on shared_ptrs that increment the reference count internally use relaxed.

Fences: atomic_thread_fence(relaxed); does nothing.

memory_order_release, memory_order_acquire do everything relaxed does, and more (so it's supposedly slower or equivalent).

Only stores (writes) can use release. Only loads (reads) can use acquire. Read-modify-write operations such as fetch_add can be both (memory_order_acq_rel), but they also can be just release or just acquire.

Those let you synchronize threads:

Let's say thread 1 reads/writes to some memory M (any non-atomic or atomic variables, doesn't matter).
Then thread 1 performs a release store to a variable A. Then it stops touching memory M.
If thread 2 then performs an acquire load of the same variable A, then the store in thread 1 is said to synchronize with with this load in thread 2.
Now thread 2 can safely read/write to that memory M (without inducing a data race UB you would have otherwise).

You only synchronize with the latest writer, not preceding writers.

You can chain synchronizations across multiple threads.

There's a special rule that synchronization propagates across any number of read-modify-write operations regardless of their memory order. E.g. if thread 1 does a.store(1, release);, then thread 2 does a.fetch_add(2, relaxed);, then thread 3 does a.load(acquire), then thread 1 successfully synchronizes with thread 3, even though there's a relaxed operation in the middle.

In the above rule, a release operation X, and any subsequent read-modify-write operations on the same variable (stopping at the next non-read-modify-write operation) are called a release sequence headed by X. (So if an acquire reads from any operation in a release sequence, it synchronizes with the head of the sequence.)

If read-modify-write operations are involved, nothing stops you from synchronizing with more than one operation. In the example above, if fetch_add was using acquire or acq_rel, thread 1 would additonally synchronize with it, and conversely, if it used release or acq_rel, it would additonally syncrhonize with thread 3.

Example use: shared_ptr decrements its reference counter using something like fetch_sub(1, acq_rel).

Here's why: imagine that thread 1 reads/writes to *ptr, then destroys its copy of ptr, decrementing the ref count. Then thread 2 destroys the last remaining pointer, also decrementing the ref count, and then runs the destructor.

Since the destructor in thread 2 is going to access the memory previously accessed by thread 1, the acq_rel synchronization in fetch_sub is necessary. Otherwise you'd have a data race and UB.

Fences: Using atomic_thread_fence, you can essentially turn relaxed atomic operations into release/acquire operations. A single fence can apply to more than one operation, and/or can be performed conditionally.

If you do a relaxed read (or with any other order) from one or more variables, then do atomic_thread_fence(acquire) in the same thread, then all those reads count as acquire operations.

Conversely, if you do atomic_thread_fence(release), followed by any number of (possibly relaxed) writes, those writes count as release operations.

An acq_rel fence combines the effect of acquire and release fences.

Of course, you can't benefit from a fence between it and the affected atomic operation (after a relaxed read but before the acquire fence; or conversely after the release fence but before a relaxed write). Without this rule we would have time travel: you could, say, call a release fence at the beginning of a thread and an acquire fence at the end, and thus bless all operations in between, which doesn't make any sense.

How to choose between relaxed order + fence vs a stronger order without fence.

Similarity with other standard library features:

Several standard library features also cause a similar synchronizes with relationship. E.g. locking a mutex synchronizes with the latest unlock, as if locking was an acquire operation, and unlocking was a release operation.

memory_order_seq_cst does everything acquire/release do, and more. This is supposedly the slowest order, but also the safest.

seq_cst reads count as acquire operations. seq_cst writes count as release operations. seq_cst read-modify-write operations count as both.

seq_cst operations can synchronize with each other, and with acquire/release operations. Beware of special effects of mixing them (see below).

seq_cst is the default order, e.g. given atomic_int x;, x = 1; does x.store(1, seq_cst);.

seq_cst has an extra property compared to acquire/release: all threads agree on the order in which all seq_cst operations happen. This is unlike weaker orders, where threads agree only on the order of operations on each individual atomic variable, but not on how the operations are interleaved - see relaxed order above.

The presence of this global operation order seems to only affect which values you can get from seq_cst loads, it doesn't in any way affect non-atomic variables and atomic operations with weaker orders (unless seq_cst fences are involved, see below), and by itself doesn't prevent any extra data race UB compared to acq/rel operations.

Among other things, this order respects the synchronizes with relationship described for acquire/release above, unless (and this is weird) that synchronization comes from mixing a seq-cst operation with an acquire/release operation (release syncing with seq-cst, or seq-cst synching with acquire). Such mix essentially demotes the affected seq-cst operation to an acquire/release (it maybe retains some of the seq-cst properties, but you better not count on it).

Example use:

atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, seq_cst);
if (y.load(seq_cst)) {...}
// Thread 2:
y.store(false, seq_cst);
if (x.load(seq_cst)) {...}

Lets say you want only one thread to be able to enter the if body. seq_cst allows you to do it. Acquire/release or weaker orders wouldn't be enough here.

Fences: atomic_thread_fence(seq_cst); does everything an acq_rel fence does, and more.

Like you would expect, they bring some seq-cst properties to atomic operations done with weaker orders.

All threads agree on the order of seq_cst fences, relative to one another and to any seq_cst operations (i.e. seq_cst fences participate in the global order of seq_cst operations, which was described above).

They essentially prevent atomic operations from being reordered across themselves.

E.g. we can transform the above example to:

atomic_bool x = true;
atomic_bool y = true;
// Thread 1:
x.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (y.load(relaxed)) {...}
// Thread 2:
y.store(false, relaxed);
atomic_thread_fence(seq_cst);
if (x.load(relaxed)) {...}

Both threads can't enter if at the same time, because that would require reordering a load across the fence to be before the store.

But formally, the standard doesn't describe them in terms of reordering. Instead, it just explains how the seq_cst fences are placed in the global order of seq_cst operations. Let's say:

Thread 1 performs operation A on atomic variable X using using seq_cst order, OR a weaker order preceeded by a seq_cst fence.

Then:
Thread 2 performs operation B the same atomic variable X using seq_cst order, OR a weaker order followed by a seq_cst fence.

(Here A and B are any operations, except they can't both be reads, since then it's impossible to determine which one was first.)

Then the first seq_cst operation/fence is ordered before the second seq_cst operation/fence.

Then, if you imagine an scenario (e.g. in the example above, both threads entering the if) that imposes a contradicting requirements on the order, then this scenario is impossible.

E.g. in the example above, if the first thread enters the if, then the first fence must be ordered before the second one. And vice versa. This means that both threads entering the if would lead to a contradition, and hence not allowed.

Interoperation between different memory orders

Summarizing the above:

	`relaxed` write	`release` write	`seq-cst` write
`relaxed` load	-	-	-
`acquire` load	-	synchronizes with	synchronizes with*
`seq-cst` load	-	synchronizes with*	synchronizes with

* = The participating seq-cst operation gets a messed up seq-cst order, effectively being demoted to an acquire/release operation. This is explained above.

Does using a stronger memory order makes data transfer between threads faster?

No, it seems not.

Sequental consistency for data-race-free programs

The standard explains that if your program only uses seq_cst accesses (and mutexes), and has no data races (which cause UB), then you don't need to think about all the fancy operation reorderings. The program will behave as if only one thread executed at a time, with the threads being unpredictably interleaved.

Various orders of operations

As explained above, there's no single timeline in which things happen in a multithreaded program. The standard defines a bunch of "orders" (or relations) between operations.

Those are just a more formal way to look at what has already been explain above.

A is sequenced before B — both happen in the same thread, and B comes after A.

Side note: you might remember that i++ + i++ is UB. The reason why it's UB is because neither i++ is sequenced before the other (though not a data race, because it's the same thread).
A synchronizes with B — This was explained in the "acquire/release" section above. Some library operations are said to synchronize with other: release (or stronger) write synchronizes with an acquire (or stronger) read of the same variable (if the value it reads comes from this write and not another one). Unlocking a mutex synchronizes with the next lock of it, etc.
A happens before B — This is a combination of 'sequenced before' and 'synchronizes with', spanning across threads. A happends before B if and only if: A is sequenced before B, or A synchronizes with B, or transitively (there's C such that A happens before C and C happens before B). (This is a slightly simplified definition, see below.)

'Happens before' is used in the definition of 'data race'. If neither A nor B happen before each other (and they are operations on the same non-atomic variable in different threads), then they create a data race and UB.
Variable modification orders — Each individual atomic variable has an associated 'modification order', i.e. the order in which all writes to it happen. (Reads do respect this order, though not formally a part of it.)

Variable modification orders are consistent with 'happens before' (inplying they are also consistent with 'sequenced before' of each individual thread).
Global seq-cst order — one single order in which all seq-cst operations (including fences) happen.

It's consistent with individual variable modification orders, and with sequenced-before.

It's usually consistent with 'happens before', but not when you mix seq-cst and acq/rel operations (see below).

It's also consistent with something called coherence-ordered before, which just looks like an extension of variable modification order that includes reads.

There are in fact several flavors of 'happens before':

Simply happens before — uses the simple definition of 'happens before' I gave above.
Inter-thread happens before — used in the definition of memory_order_consume, hence completely useless.
Happens before — a combination of 'simply happens before' and 'inter-thread happens before', hence for all practical purposes is equivalent to 'simply happens before'.

This is what's used in the definion of a data race, as stated above.
Strongly happens before — like 'simply happens before', but excludes mixed synchronization of seq-cst and acq-rel.

Therefore the global seq-cst order is consistent with 'strongly happens before', but not always with '(simply) happens before'.

Bolick answered 4/1, 2022 at 22:32 Comment(14)

Is there any means by which code given e.g. a uint32_t * (as opposed to a pointer to some atomic type) can perform a read with semantics that (1) if a normal read would have no data race, perform one, but (2) even if a normal read would have a data race, yield an arbitrary value without side effects? Even memory-order relaxed offers stronger guarantees than that, but would only be useful if one has a pointer to an atomic integer object. – Failure 27/5, 2022 at 21:45

@Failure You could just perform a regular read, even though that's UB. You could also use std::atomic_ref, but then the concurrent write also has to use it. – Bolick 28/5, 2022 at 9:26

I was wondering whether the Standard provides for such a thing, or if it relies upon implementations to uphold the principle "Don't prevent the programmer from doing what needs to be done"--a reliance that was reasonable in 1989 but had become unreasonable by 2011. Without that, the language would be unsuitable for any situation where code running with elevated privileges would need to access anything that is accessible by code running at lower privileges, unless the higher-privilege code had a means of suspending lower-privilege threads. – Failure 28/5, 2022 at 16:34

Question 1: (about release-acquire) You said You only synchronize with the latest writer, not preceding writers. So if threads A and B each stores a value before the load in thread C, then only the second store is synchronized with the load? This is counter-intuitive because the first store happens before the second store. Your use of preceding seems vauge: Is that in program order or execution order? – Roice 10/9, 2023 at 10:27

Question 2: (about fences) You said Using atomic_thread_fence, you can essentially turn relaxed atomic operations into release/acquire operations. If this were true, we can place a release fence at the entrance and an acquire fence at every exit of a thread execution, which would turn every relaxed operation into acquire/release operation. This is unlikely true. I think when there are two fenced operations, the synchronization is on the fences themselves but not on the operations. – Roice 10/9, 2023 at 10:32

@Roice "preceding seems vauge" Preceding in the modification order of that specific variable. There's no "program/execution order", unless you mean the seq/cst order. "only the second store is synchronized with the load?" Yes, it seems so. eel.is/c++draft/atomics.order#2 I'm not knowledgable enough to know what happens on the hardware level here, just trying to translate the standard into plain English. – Bolick 10/9, 2023 at 10:38

@Roice "synchronization is on the fences themselves but not on the operations" Yep. I'll try to make the wording more clear here. – Bolick 10/9, 2023 at 10:40

well, std::memory_order::consume aka memory_order_consume was introduced in C++20 and it might be "future promise" to support something related to certain VM or CPU behaviour. – Roquefort 20/11, 2023 at 8:44

@Swift-FridayPie "was introduced in C++20" Only std::memory_order::consume was introduced C++20, std::memory_order_consume has always existed. C++20 added the spelling std::memory_order::X (in addition to the already existing std::memory_order_X), for all memory orders, not just consume. – Bolick 20/11, 2023 at 8:47

@Bolick it did , but as you said it hadn't had a solid support. From what I read C++20 didn't added, they replaced old spelling by new one, memory_order_consume being an alias? Apparently a SPSC queue implementation using C++20 version does exist – Roquefort 20/11, 2023 at 8:55

@Swift-FridayPie What I'm saying is, this change doesn't show a renewed interest in consume, since it equally affects all other orders. "SPSC queue implementation using C++20 version does exist" Then it's probably a theoretical curiosity? Or written for some old/obscure platform that supports a true consume? – Bolick 20/11, 2023 at 9:6

@Bolick the case I know is using somethignlike transparent shared memory and works around false sharing because processor cores are de-facto separate CPU crystals with separate L1 caches (and on separate PCBs actually) but are running single executable. Think of VLIW\SPARC analogs (you may guess which brand of computers I mean). How obscure or non-stadard such architecture is, I;m not sure, producers claims that it's "normal" and "modern". – Roquefort 20/11, 2023 at 9:40

@Swift-FridayPie Are you saying compilers for elbruses actually support consume? Or that the hardware could theoretically support it? Do you have any links/sources? – Bolick 20/11, 2023 at 9:57

New compilers are at least C++14, a folly kind of queue appear to work. Before that... they had own "atomic" library I learned to avoid. No, not really and I'm mystified at how that whole architecture works, they don't document it openly (even the ISA). But I was curious if there is something in "wide" world with similar behaviour. – Roquefort 20/11, 2023 at 10:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

IMPORTANT:

Here follows my own attempt at a more mundane explanation:

Memory orders:

Recommended topics

Hot tags