memory_order_relaxed and Atomic RMW operations
Asked Answered
M

1

7

The C++ Standard says that RMW (Read-Modify-Write) operations on atomics will operate on the latest value of the atomic variable. Consequently using memory_order_relaxed with these operations won't affect the RMW operation when executed concurrently from multiple threads.

I am assuming that this behavior is possible only if there is some memory barrier or fence in place for RMW operations even when the memory order specified is "relaxed". Please correct me if my understanding is wrong and explain how these operations work on the latest value if no such memory barrier is used. If my understanding is correct, then can I further assume that using Acquire-Release or Seq-CST memory order should not have additional performance hits for RMW operations on say a weakly ordered architecture like ARM or Alpha. Thanks in advance.

Multivocal answered 6/11, 2017 at 17:29 Comment(17)
Surely the implementation depends on the platform and osOrdinarily
Yes it is implementation dependent but how about the second part of my question - would using seq CST or acq-rel with these operations have any additional performance impact as compared to relaxed ordering given the requirements in the standard?Multivocal
I do not know. Have you measured it? Is this a performance problem in your system?Ordinarily
Regardless of memory ordering, RMW's are guaranteed to operate on the latest in the modification order. Memory ordering stronger than 'relaxed' defines how other operations are ordered inter-thread with respect to the RMW itselfSelfsatisfaction
@Ed Heal - Am asking it for general knowledge as we will be porting our code base to ARM v7 processor and I would like to know about the performance impact if any. I haven't done any profiling yet.Multivocal
@Selfsatisfaction - wouldn't the condition that "operate on the latest value" require something like a memory barrier? With a memory barrier in place it would automatically mean that other inter-thtead operations would then be automatically ordered wrt to the memory barrier and hence in effect they will be ordered wrt to the RMW operation itself. That is why I am curious if Seq-CST or acq-rel will have an additional performance impactMultivocal
@Madhusudhan It's a common mistake, but it would not. Barriers (or fences) preserve program order and make it visible to other CPU's.Selfsatisfaction
@EdHeal You're not going to be able to measure this on a typical PC since x86 has a hard memory barriers for all RMW (lock-prefix) instructions. So there's no difference between sequential-consistency and acquire-release. OTOH, relaxed semantics will allow the compiler to move other things across the RMW itself.Brasier
@Selfsatisfaction Barriers in general don't make stuff visible to other CPU, the prevent some stuff from being visible until other stuff is visible.Bellybutton
@Brasier "wouldn't the condition that "operate on the latest value" require something like a memory barrier" No, as operating on anything else is simply not possible. You can only read the latest value in real time and write the latest value. The CPU doesn't time travel (neither does memory). Latest is only defined for one memory location. The "latest value" stuff is meaningless and confusing, it just says that the operation is atomic: the latest before the modification.Bellybutton
@Bellybutton hmm yes, you can say so.. I don't recall my train of thought back then, but it also depends on whether the mentioned fences were CPU fence instructions or fences in the C++ memory modelSelfsatisfaction
@Selfsatisfaction f.ex. Intel mfence: it can't make write visible or push them to memory any faster as the core is always trying to push writes to its cache, from where they are globally visible. But that instruction prevents following reads from being getting an older value. (That can be accomplished either by waiting writes to be in cache or by doing reads ASAP and then double checking their validity later.)Bellybutton
@Bellybutton That's one way of putting it., mfence flushes the store buffer, so following reads will not be performed until after the stores have become globally visible...Selfsatisfaction
@Selfsatisfaction "flushes the store buffer" seems to suggest that like cout the store buffer needs to be flushed to have a visible effect. That's highly misleading: unlike a file buffer, the CPU store buffer is being flushed right now.Bellybutton
@Bellybutton mfence causes the store buffer to be flushed before following instructions (loads) are performed. That is to prevent stores and loads from being reordered (i.e. #StoreLoad barrier).. If you don't flush the store buffer, the stores will still be committed to memory, but possibly after the loads have been performed and that is a problem for some algorithms. The Intel manual calls it 'serializing' load and store operationsSelfsatisfaction
@Selfsatisfaction It waits for the buffer to be empty (stores committed) to commit the following load: no loaded value is allowed that isn't up to date, compared to these stores. (But the optimistic core could still do loads early in a speculative execution as long as the values are checked again at the end.) The documentation may suggest that mfence stops execution until stores are "flushed" but I don't think that's what is actually guaranteed in future processors.Bellybutton
You can say it flushes the sb, or it waits until the sb is flushed, but the effect is the same.. Here's what the Intel manual says about mfence: It guarantees that all loads and stores specified before the fence are globally observable prior to any loads or stores being carried out after the fence..Selfsatisfaction
G
6

This is an unfortunately common misconception about the atomic memory orders. See, those do not (entirely) apply to the actual atomic operation. They apply mainly to other operations around them.

For example:

//accessible from anywhere
std::atomic<bool> flag;
int value = 0;

//code in thread 1:
value = 1;
flag.store(true, <order_write>);

//code in thread 2:
bool true_val = true;
while(!flag.compare_exchange_weak(true_val, false, <order_read>);
int my_val = value;

So, what is this doing? Thread 2 is waiting on thread 1 to signal that value has been updated, then thread 2 reads value.

<order_write> and <order_read> do not govern the behavior of how the specific atomic variable is seen. It governs the behavior of how other values that were set before/after that atomic operation are seen.

In order for this code to work, <order_write> must use a memory order that is at least as strong as memory_order_release. And <order_read> must use a memory order that is at least as strong as memory_order_acquire.

These memory orders affect how value is transferred (or more specifically, the stuff set before the atomic write).

wouldn't the condition that "operate on the latest value" require something like a memory barrier?

It is unlikely that most architectures implement the actual atomic modification using a global memory barrier. It takes the the non-relaxed memory orders to do that: they impose a general memory barrier on the writers and readers.

Atomic operations, if they need a memory barrier to work at all, will typically use a local memory barrier. That is, a barrier specific to the address of the atomic variable.

So it is reasonable to assume that non-relaxed memory orders will hurt performance more than a relaxed memory order. That's not a guarantee of course, but it's a pretty good first-order approximation.

Is it possible for atomic implementations to use a full global memory barrier on any atomic operation? Yes. But if an implementation resorts to that for fundamental atomic types, then the architecture probably has no other choice. So if your algorithm requires atomic operations, you don't really have any other choice.

Golconda answered 6/11, 2017 at 18:1 Comment(4)
I am already aware of the first part of your answer. Perhaps in my question I should have clarified that I meant "the use of a barrier would make relaxed ordering behave similarly to acq-rel". I am however not familiar with a local memory barrier. I will read up more on it and check back here following that.Multivocal
@Madhusudhan In C++ there is no such thing as a local or global memory barrierSelfsatisfaction
@LWimsey: If you want to get technical, there's no such thing as a "memory barrier" of any kind in C++. My terminology here is just to represent the distinction between "the stuff the CPU/compiler has to do to make a specific memory address available" and "the stuff the CPU/compiler has to do to make all memory addresses available." The actual atomic RMW is the former; the memory ordering operations are the latter.Golconda
The C++ standard calls them fencesSelfsatisfaction

© 2022 - 2024 — McMap. All rights reserved.