memory_order_relaxed and Atomic RMW operations

The C++ Standard says that RMW (Read-Modify-Write) operations on atomics will operate on the latest value of the atomic variable. Consequently using memory_order_relaxed with these operations won't affect the RMW operation when executed concurrently from multiple threads.

I am assuming that this behavior is possible only if there is some memory barrier or fence in place for RMW operations even when the memory order specified is "relaxed". Please correct me if my understanding is wrong and explain how these operations work on the latest value if no such memory barrier is used. If my understanding is correct, then can I further assume that using Acquire-Release or Seq-CST memory order should not have additional performance hits for RMW operations on say a weakly ordered architecture like ARM or Alpha. Thanks in advance.

This is an unfortunately common misconception about the atomic memory orders. See, those do not (entirely) apply to the actual atomic operation. They apply mainly to other operations around them.

For example:

//accessible from anywhere
std::atomic<bool> flag;
int value = 0;

//code in thread 1:
value = 1;
flag.store(true, <order_write>);

//code in thread 2:
bool true_val = true;
while(!flag.compare_exchange_weak(true_val, false, <order_read>);
int my_val = value;

So, what is this doing? Thread 2 is waiting on thread 1 to signal that value has been updated, then thread 2 reads value.

<order_write> and <order_read> do not govern the behavior of how the specific atomic variable is seen. It governs the behavior of how other values that were set before/after that atomic operation are seen.

In order for this code to work, <order_write> must use a memory order that is at least as strong as memory_order_release. And <order_read> must use a memory order that is at least as strong as memory_order_acquire.

These memory orders affect how value is transferred (or more specifically, the stuff set before the atomic write).

wouldn't the condition that "operate on the latest value" require something like a memory barrier?

It is unlikely that most architectures implement the actual atomic modification using a global memory barrier. It takes the the non-relaxed memory orders to do that: they impose a general memory barrier on the writers and readers.

Atomic operations, if they need a memory barrier to work at all, will typically use a local memory barrier. That is, a barrier specific to the address of the atomic variable.

So it is reasonable to assume that non-relaxed memory orders will hurt performance more than a relaxed memory order. That's not a guarantee of course, but it's a pretty good first-order approximation.

Is it possible for atomic implementations to use a full global memory barrier on any atomic operation? Yes. But if an implementation resorts to that for fundamental atomic types, then the architecture probably has no other choice. So if your algorithm requires atomic operations, you don't really have any other choice.

Recommended topics

Hot tags