Acquire/Release versus Sequentially Consistent memory order
Asked Answered
M

2

49

For any std::atomic<T> where T is a primitive type:

If I use std::memory_order_acq_rel for fetch_xxx operations, and std::memory_order_acquire for load operation and std::memory_order_release for store operation blindly (I mean just like resetting the default memory ordering of those functions)

  • Will the results be same as if I used std::memory_order_seq_cst (which is being used as default) for any of the declared operations?
  • If the results were the same, is this usage anyhow different than using std::memory_order_seq_cst in terms of efficiency?
Maltese answered 13/2, 2013 at 19:50 Comment(3)
It depends on what the underlying hardware has to offer. If you don't know specifically how that works, and are forced to optimize according to that, the defaults are probably ok. On the common x86 systems there will be very little difference, if any.Hackney
@Bo Persson on x86, gcc inserts a full MFENCE after a seq_cst store. That makes it significantly slowerCyanocobalamin
seq_cst pure-stores are slower on some ISAs, notably x86, because it has to prevent StoreLoad reordering, which doesn't matter for most synchronization. Separately, see also Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - anything weaker than seq_cst allows IRIW reordering, which is something POWER hardware can actually do (but not much if any other mainstream CPUs). Most other ISAs always agree on store order.Harvester
N
98

The C++11 memory ordering parameters for atomic operations specify constraints on the ordering. If you do a store with std::memory_order_release, and a load from another thread reads the value with std::memory_order_acquire then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations.

If both the store and subsequent load are std::memory_order_seq_cst then the relationship between these two threads is the same. You need more threads to see the difference.

e.g. std::atomic<int> variables x and y, both initially 0.

Thread 1:

x.store(1,std::memory_order_release);

Thread 2:

y.store(1,std::memory_order_release);

Thread 3:

int a=x.load(std::memory_order_acquire); // x before y
int b=y.load(std::memory_order_acquire); 

Thread 4:

int c=y.load(std::memory_order_acquire); // y before x
int d=x.load(std::memory_order_acquire);

As written, there is no relationship between the stores to x and y, so it is quite possible to see a==1, b==0 in thread 3, and c==1 and d==0 in thread 4.

If all the memory orderings are changed to std::memory_order_seq_cst then this enforces an ordering between the stores to x and y. Consequently, if thread 3 sees a==1 and b==0 then that means the store to x must be before the store to y, so if thread 4 sees c==1, meaning the store to y has completed, then the store to x must also have completed, so we must have d==1.

In practice, then using std::memory_order_seq_cst everywhere will add additional overhead to either loads or stores or both, depending on your compiler and processor architecture. e.g. a common technique for x86 processors is to use XCHG instructions rather than MOV instructions for std::memory_order_seq_cst stores, in order to provide the necessary ordering guarantees, whereas for std::memory_order_release a plain MOV will suffice. On systems with more relaxed memory architectures the overhead may be greater, since plain loads and stores have fewer guarantees.

Memory ordering is hard. I devoted almost an entire chapter to it in my book.

Nadbus answered 13/2, 2013 at 22:29 Comment(18)
I was looking forward to a example of a failure, thanks for the answer.Maltese
"If all the memory orderings are changed to std::memory_order_seq_cst then this enforces an ordering between the stores to x and y" - is it possible to achieve same effect by setting only some of orderings to seq_cst? like y.storeCorr
No. The "single total order" constraint only applies to memory_order_seq_cst operations. Operations with other memory orderings are not included, and can therefore appear in different orders in different threads, provided any other constraints are satisfied.Nadbus
@Anthony Williams Did you know why in x86 systems use XCHG instead of MOV with std::memory_order_seq_cst - to lock ring-bus (modified QPI), which are distributing the changes between the different segments of the cache L3? Such changes are propagated Core0<->Core1<->Core2<->Core3<->Core0 Therefore from adjacent cores obtain changes faster than obtain from distant ones. And XCHG is locking ring-bus (modified QPI) for a time, while the data from only a single of cores does not extend to all segments of the cache L3 for other cores.Zigzag
@Anthony Williams And only after that the data from next core can extend to all segments of L3. lostcircuits.com/mambo//… The same similarly, for different CPUs(connected by QPI) in NUMA system - this is stops all cores in all CPUs which try to access to L3/RAM by using XCHG. An opposite in x86 MOV is locking only one cache line through cache coherence MOESI/MESIF protocol. This is a very big difference in performance for large multiprocessor systems.Zigzag
Can there be an example of a non-sequentially-consistent execution involving only acquire and release and only two threads?Terri
Antony, and thats exactly the problem, I am here because the memory ordering enum values in your book are used way before(if at all) they are properly explained... And let me tell you - it's no fun to study anything when the code uses stuff that will (probably) be explained much later without clearly stating that it willShebat
@Shebat I totally agree with your point. Anthony did a great job in his book, but it's hard to understand due to some bad organization..Refrigeration
Can I just say that, in seq_cst tagging, every atomic operation will 'virtually' go to a global table to register their operation, every other atomic operation will see the same ordering listed in that table. But in acquire/released tagging, the released atomic operation on 'x' (virtually) register their operation on a table, which is only seen by the thread using acquire atomic operation on 'x'. Am I right?Interpenetrate
It's not that simple, sadly. For seq_cst, yes you can imagine a single global table. For acquire/release it's far more complicated. If an acquire load sees the value written by a release store, then everything visible to the thread that did the store, at the point of the store, is now visible to the thread doing the load. If the acquire load did not see the value written by the release (and there is no guarantee that it will), then no ordering or visibility is guaranteed.Nadbus
@AnthonyWilliams with std::memory_order_seq_cst I sorta need a clarification when you say so if thread 4 sees c==1, meaning the store to y has completed, then the store to x must also have completed, so we must have d==1 what I see in the given example is two threads are doing store to x & y ,So although the memory order is seq_cst there would be a possible interleaving between threads in that case is it guaranteed that when c==1 then d must == 1 what will happen if interleaving happened like this T2:y.store(1) T4:y.load() T4:x.load() T1:x.store(1) ends in c==1,d==0Safelight
then subsequent read operations from the second thread will see any values stored to any memory location by the first thread that were prior to the store-release, or a later store to any of those memory locations. sorry, what does this part refer to - or a later store to any of those memory locationsForedoom
@Foredoom You can view it as history. The value seen in the past are in the past; the state might have changed since. What can't happen is going from the future to the past. It's like reading the time: you can see a more recent time when you read it twice, not an older time. If you read it fast you often see the same time.Justice
As far as I can see this talk by Herb Sutter gives example at 57:25 which directly contradicts this answerOrleans
I've created a simple version of the example given in this answer using store release and read acquire. When I run it I never encounter the situation where a=1, b=0 and c=1, d=0 which suggests that there actually is a relationship between stores to X and Y. I know I cannot prove anything by example, only disprove but I have a feeling this answer is wrong.Orleans
Compilers and processors can strengthen the guarantees they provide: it is entirely legal for a compiler to ignore the memory ordering parameters and always provide memory_order_seq_cst.Nadbus
In Herb's talk he is referring to Sequentially Consistent operations (everything with memory_order_seq_cst). His example is correct, but so is my answer here: the slackening of the memory ordering to memory_order_acquire is what allows the reordering.Nadbus
yes, my example is not valid. I just run the same code but with memory_order_seq_relaxed and I still observed no reordering so it seems my CPU has strong memory model and I cannot test this. Also, you are correct that Herb is talking about sequential consistency in the video, I totally misinterpreted that part. Sorry for the confusion and thanks for clarification!Orleans
A
14

Memory ordering can be quite tricky, and the effects of getting it wrong is often very subtle.

The key point with all memory ordering is that it guarantees what "HAS HAPPENED", not what is going to happen. For example, if you store something to a couple of variables (e.g. x = 7; y = 11;), then another processor may be able to see y as 11 before it sees the value 7 in x. By using memory ordering operation between setting x and setting y, the processor that you are using will guarantee that x = 7; has been written to memory before it continues to store something in y.

Most of the time, it's not REALLY important which order your writes happen, as long as the value is updated eventually. But if we, say, have a circular buffer with integers, and we do something like:

buffer[index] = 32;
index = (index + 1)  % buffersize; 

and some other thread is using index to determine that the new value has been written, then we NEED to have 32 written FIRST, then index updated AFTER. Otherwise, the other thread may get old data.

The same applies to making semaphores, mutexes and such things work - this is why the terms release and acquire are used for the memory barrier types.

Now, the cst is the most strict ordering rule - it enforces that both reads and writes of the data you've written goes out to memory before the processor can continue to do more operations. This will be slower than doing the specific acquire or release barriers. It forces the processor to make sure stores AND loads have been completed, as opposed to just stores or just loads.

How much difference does that make? It is highly dependent on what the system archiecture is. On some systems, the cache needs to flushed [partially] and interrupts sent from one core to another to say "Please do this cache-flushing work before you continue" - this can take several hundred cycles. On other processors, it's only some small percentage slower than doing a regular memory write. X86 is pretty good at doing this fast. Some types of embedded processors, (some models of - not sure?)ARM for example, require a bit more work in the processor to ensure everything works.

Ambrose answered 13/2, 2013 at 20:7 Comment(2)
when you said "goes out to memory", we have to understand from cpu caches to central memory ?Syllogism
Yes @Guillaume ParisIngram

© 2022 - 2024 — McMap. All rights reserved.