Acquire/Release versus Sequentially Consistent memory order

Asked 13/2, 2013 at 19:50 Answered 13/2, 2013 at 22:29

Solved c++concurrency atomic stdatomic memory-model

For any std::atomic<T> where T is a primitive type:

If I use std::memory_order_acq_rel for fetch_xxx operations, and std::memory_order_acquire for load operation and std::memory_order_release for store operation blindly (I mean just like resetting the default memory ordering of those functions)

Will the results be same as if I used std::memory_order_seq_cst (which is being used as default) for any of the declared operations?
If the results were the same, is this usage anyhow different than using std::memory_order_seq_cst in terms of efficiency?

Maltese answered 13/2, 2013 at 19:50 Comment(3)

It depends on what the underlying hardware has to offer. If you don't know specifically how that works, and are forced to optimize according to that, the defaults are probably ok. On the common x86 systems there will be very little difference, if any. – Hackney 13/2, 2013 at 19:58

@Bo Persson on x86, gcc inserts a full MFENCE after a seq_cst store. That makes it significantly slower – Cyanocobalamin 24/1, 2017 at 0:32

seq_cst pure-stores are slower on some ISAs, notably x86, because it has to prevent StoreLoad reordering, which doesn't matter for most synchronization. Separately, see also Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - anything weaker than seq_cst allows IRIW reordering, which is something POWER hardware can actually do (but not much if any other mainstream CPUs). Most other ISAs always agree on store order. – Harvester 31/8, 2021 at 18:49

The C++11 memory ordering parameters for atomic operations specify constraints on the ordering. If you do a store with std::memory_order_release, and a load from another thread reads the value with std::memory_order_acquire then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations.

If both the store and subsequent load are std::memory_order_seq_cst then the relationship between these two threads is the same. You need more threads to see the difference.

e.g. std::atomic<int> variables x and y, both initially 0.

Thread 1:

x.store(1,std::memory_order_release);

Thread 2:

y.store(1,std::memory_order_release);

Thread 3:

int a=x.load(std::memory_order_acquire); // x before y
int b=y.load(std::memory_order_acquire);

Thread 4:

int c=y.load(std::memory_order_acquire); // y before x
int d=x.load(std::memory_order_acquire);

As written, there is no relationship between the stores to x and y, so it is quite possible to see a==1, b==0 in thread 3, and c==1 and d==0 in thread 4.

If all the memory orderings are changed to std::memory_order_seq_cst then this enforces an ordering between the stores to x and y. Consequently, if thread 3 sees a==1 and b==0 then that means the store to x must be before the store to y, so if thread 4 sees c==1, meaning the store to y has completed, then the store to x must also have completed, so we must have d==1.

In practice, then using std::memory_order_seq_cst everywhere will add additional overhead to either loads or stores or both, depending on your compiler and processor architecture. e.g. a common technique for x86 processors is to use XCHG instructions rather than MOV instructions for std::memory_order_seq_cst stores, in order to provide the necessary ordering guarantees, whereas for std::memory_order_release a plain MOV will suffice. On systems with more relaxed memory architectures the overhead may be greater, since plain loads and stores have fewer guarantees.

Memory ordering is hard. I devoted almost an entire chapter to it in my book.

Nadbus answered 13/2, 2013 at 22:29 Comment(18)

I was looking forward to a example of a failure, thanks for the answer. – Maltese 13/2, 2013 at 22:40

"If all the memory orderings are changed to std::memory_order_seq_cst then this enforces an ordering between the stores to x and y" - is it possible to achieve same effect by setting only some of orderings to seq_cst? like y.store – Corr 15/2, 2013 at 14:18

No. The "single total order" constraint only applies to memory_order_seq_cst operations. Operations with other memory orderings are not included, and can therefore appear in different orders in different threads, provided any other constraints are satisfied. – Nadbus 15/2, 2013 at 14:25

@Anthony Williams Did you know why in x86 systems use XCHG instead of MOV with std::memory_order_seq_cst - to lock ring-bus (modified QPI), which are distributing the changes between the different segments of the cache L3? Such changes are propagated Core0<->Core1<->Core2<->Core3<->Core0 Therefore from adjacent cores obtain changes faster than obtain from distant ones. And XCHG is locking ring-bus (modified QPI) for a time, while the data from only a single of cores does not extend to all segments of the cache L3 for other cores. – Zigzag 31/8, 2013 at 15:52

@Anthony Williams And only after that the data from next core can extend to all segments of L3. lostcircuits.com/mambo//… The same similarly, for different CPUs(connected by QPI) in NUMA system - this is stops all cores in all CPUs which try to access to L3/RAM by using XCHG. An opposite in x86 MOV is locking only one cache line through cache coherence MOESI/MESIF protocol. This is a very big difference in performance for large multiprocessor systems. – Zigzag 31/8, 2013 at 15:52

Can there be an example of a non-sequentially-consistent execution involving only acquire and release and only two threads? – Terri 24/12, 2014 at 1:28

Antony, and thats exactly the problem, I am here because the memory ordering enum values in your book are used way before(if at all) they are properly explained... And let me tell you - it's no fun to study anything when the code uses stuff that will (probably) be explained much later without clearly stating that it will – Shebat 25/1, 2015 at 15:10

@Shebat I totally agree with your point. Anthony did a great job in his book, but it's hard to understand due to some bad organization.. – Refrigeration 7/10, 2016 at 15:31

Can I just say that, in seq_cst tagging, every atomic operation will 'virtually' go to a global table to register their operation, every other atomic operation will see the same ordering listed in that table. But in acquire/released tagging, the released atomic operation on 'x' (virtually) register their operation on a table, which is only seen by the thread using acquire atomic operation on 'x'. Am I right? – Interpenetrate 16/3, 2017 at 10:51

It's not that simple, sadly. For seq_cst, yes you can imagine a single global table. For acquire/release it's far more complicated. If an acquire load sees the value written by a release store, then everything visible to the thread that did the store, at the point of the store, is now visible to the thread doing the load. If the acquire load did not see the value written by the release (and there is no guarantee that it will), then no ordering or visibility is guaranteed. – Nadbus 17/3, 2017 at 11:5

@AnthonyWilliams with std::memory_order_seq_cst I sorta need a clarification when you say

so if thread 4 sees c==1, meaning the store to y has completed, then the store to x must also have completed, so we must have d==1

what I see in the given example is two threads are doing store to x & y ,So although the memory order is seq_cst there would be a possible interleaving between threads in that case is it guaranteed that when c==1 then d must == 1 what will happen if interleaving happened like this T2:y.store(1) T4:y.load() T4:x.load() T1:x.store(1) ends in c==1,d==0 – Safelight 17/8, 2017 at 9:9

then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations.

sorry, what does this part refer to - or a later store to any of those memory locations – Foredoom 8/8, 2018 at 11:17

@Foredoom You can view it as history. The value seen in the past are in the past; the state might have changed since. What can't happen is going from the future to the past. It's like reading the time: you can see a more recent time when you read it twice, not an older time. If you read it fast you often see the same time. – Justice 17/12, 2019 at 6:55

As far as I can see this talk by Herb Sutter gives example at 57:25 which directly contradicts this answer – Orleans 19/6, 2020 at 15:8

I've created a simple version of the example given in this answer using store release and read acquire. When I run it I never encounter the situation where a=1, b=0 and c=1, d=0 which suggests that there actually is a relationship between stores to X and Y. I know I cannot prove anything by example, only disprove but I have a feeling this answer is wrong. – Orleans 19/6, 2020 at 16:29

Compilers and processors can strengthen the guarantees they provide: it is entirely legal for a compiler to ignore the memory ordering parameters and always provide memory_order_seq_cst. – Nadbus 20/6, 2020 at 17:19

In Herb's talk he is referring to Sequentially Consistent operations (everything with memory_order_seq_cst). His example is correct, but so is my answer here: the slackening of the memory ordering to memory_order_acquire is what allows the reordering. – Nadbus 20/6, 2020 at 17:21

yes, my example is not valid. I just run the same code but with memory_order_seq_relaxed and I still observed no reordering so it seems my CPU has strong memory model and I cannot test this. Also, you are correct that Herb is talking about sequential consistency in the video, I totally misinterpreted that part. Sorry for the confusion and thanks for clarification! – Orleans 22/6, 2020 at 8:37

Memory ordering can be quite tricky, and the effects of getting it wrong is often very subtle.

The key point with all memory ordering is that it guarantees what "HAS HAPPENED", not what is going to happen. For example, if you store something to a couple of variables (e.g. x = 7; y = 11;), then another processor may be able to see y as 11 before it sees the value 7 in x. By using memory ordering operation between setting x and setting y, the processor that you are using will guarantee that x = 7; has been written to memory before it continues to store something in y.

Most of the time, it's not REALLY important which order your writes happen, as long as the value is updated eventually. But if we, say, have a circular buffer with integers, and we do something like:

buffer[index] = 32;
index = (index + 1)  % buffersize;

and some other thread is using index to determine that the new value has been written, then we NEED to have 32 written FIRST, then index updated AFTER. Otherwise, the other thread may get old data.

The same applies to making semaphores, mutexes and such things work - this is why the terms release and acquire are used for the memory barrier types.

Now, the cst is the most strict ordering rule - it enforces that both reads and writes of the data you've written goes out to memory before the processor can continue to do more operations. This will be slower than doing the specific acquire or release barriers. It forces the processor to make sure stores AND loads have been completed, as opposed to just stores or just loads.

How much difference does that make? It is highly dependent on what the system archiecture is. On some systems, the cache needs to flushed [partially] and interrupts sent from one core to another to say "Please do this cache-flushing work before you continue" - this can take several hundred cycles. On other processors, it's only some small percentage slower than doing a regular memory write. X86 is pretty good at doing this fast. Some types of embedded processors, (some models of - not sure?)ARM for example, require a bit more work in the processor to ensure everything works.

Ambrose answered 13/2, 2013 at 20:7 Comment(2)

when you said "goes out to memory", we have to understand from cpu caches to central memory ? – Syllogism 14/12, 2014 at 20:15

Yes @Guillaume Paris – Ingram 1/12, 2022 at 7:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags