This is an unfortunately common misconception about the atomic memory orders. See, those do not (entirely) apply to the actual atomic operation. They apply mainly to other operations around them.
For example:
//accessible from anywhere
std::atomic<bool> flag;
int value = 0;
//code in thread 1:
value = 1;
flag.store(true, <order_write>);
//code in thread 2:
bool true_val = true;
while(!flag.compare_exchange_weak(true_val, false, <order_read>);
int my_val = value;
So, what is this doing? Thread 2 is waiting on thread 1 to signal that value
has been updated, then thread 2 reads value
.
<order_write>
and <order_read>
do not govern the behavior of how the specific atomic variable is seen. It governs the behavior of how other values that were set before/after that atomic operation are seen.
In order for this code to work, <order_write
> must use a memory order that is at least as strong as memory_order_release
. And <order_read>
must use a memory order that is at least as strong as memory_order_acquire
.
These memory orders affect how value
is transferred (or more specifically, the stuff set before the atomic write).
wouldn't the condition that "operate on the latest value" require something like a memory barrier?
It is unlikely that most architectures implement the actual atomic modification using a global memory barrier. It takes the the non-relaxed memory orders to do that: they impose a general memory barrier on the writers and readers.
Atomic operations, if they need a memory barrier to work at all, will typically use a local memory barrier. That is, a barrier specific to the address of the atomic variable.
So it is reasonable to assume that non-relaxed memory orders will hurt performance more than a relaxed memory order. That's not a guarantee of course, but it's a pretty good first-order approximation.
Is it possible for atomic implementations to use a full global memory barrier on any atomic operation? Yes. But if an implementation resorts to that for fundamental atomic types, then the architecture probably has no other choice. So if your algorithm requires atomic operations, you don't really have any other choice.
mfence
: it can't make write visible or push them to memory any faster as the core is always trying to push writes to its cache, from where they are globally visible. But that instruction prevents following reads from being getting an older value. (That can be accomplished either by waiting writes to be in cache or by doing reads ASAP and then double checking their validity later.) – Bellybuttonmfence
flushes the store buffer, so following reads will not be performed until after the stores have become globally visible... – Selfsatisfactioncout
the store buffer needs to be flushed to have a visible effect. That's highly misleading: unlike a file buffer, the CPU store buffer is being flushed right now. – Bellybuttonmfence
causes the store buffer to be flushed before following instructions (loads) are performed. That is to prevent stores and loads from being reordered (i.e. #StoreLoad barrier).. If you don't flush the store buffer, the stores will still be committed to memory, but possibly after the loads have been performed and that is a problem for some algorithms. The Intel manual calls it 'serializing' load and store operations – Selfsatisfactionmfence
stops execution until stores are "flushed" but I don't think that's what is actually guaranteed in future processors. – Bellybuttonmfence
: It guarantees that all loads and stores specified before the fence are globally observable prior to any loads or stores being carried out after the fence.. – Selfsatisfaction