Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?

TL;DR: In a producer-consumer queue does it ever make sense to put an unnecessary (from C++ memory model viewpoint) memory fence, or unnecessarily strong memory order to have better latency at the expense of possibly worse throughput?

C++ memory model is executed on the hardware by having some sort of memory fences for stronger memory orders and not having them on weaker memory orders.

In particular, if producer does store(memory_order_release), and consumer observes the stored value with load(memory_order_acquire), there are no fences between load and store. On x86 there are no fences at all, on ARM fences are put operation before store and after load.

The value stored without a fence will eventually be observed by load without a fence (possibly after few unsuccessful attempts)

I'm wondering if putting a fence on either of sides of the queue can make the value to be observed faster? What is the latency with and without fence, if so?

I expect that just having a loop with load(memory_order_acquire) and pause / yield limited to thousands of iterations is the best option, as it is used everywhere, but want to understand why.

Since this question is about hardware behavior, I expect there's no generic answer. If so, I'm wondering mostly about x86 (x64 flavor), and secondarily about ARM.

Example:

T queue[MAX_SIZE]

std::atomic<std::size_t>   shared_producer_index;

void producer()
{
   std::size_t private_producer_index = 0;

   for(;;)
   {
       private_producer_index++;  // Handling rollover and queue full omitted

       /* fill data */;

      shared_producer_index.store(
          private_producer_index, std::memory_order_release);
      // Maybe barrier here or stronger order above?
   }
}


void consumer()
{
   std::size_t private_consumer_index = 0;

   for(;;)
   {
       std::size_t observed_producer_index = shared_producer_index.load(
          std::memory_order_acquire);

       while (private_consumer_index == observed_producer_index)
       {
           // Maybe barrier here or stronger order below?
          _mm_pause();
          observed_producer_index= shared_producer_index.load(
             std::memory_order_acquire);
          // Switching from busy wait to kernel wait after some iterations omitted
       }

       /* consume as much data as index difference specifies */;

       private_consumer_index = observed_producer_index;
   }
}

Basically no significant effect on inter-core latency, and definitely never worth using "blindly" without careful profiling, if you suspect there might be any contention from later loads missing in cache.

It's a common misconception that asm barriers are needed to make the store buffer commit to cache. In fact barriers just make this core wait for something that was already going to happen on its own, before doing later loads and/or stores. For a full barrier, blocking later loads and stores until the store buffer is drained. Size of store buffers on Intel hardware? What exactly is a store buffer?

In the bad old days before std::atomic, compiler barriers were one way to stop the compiler from keeping values in registers (private to a CPU core / thread, not coherent), but that's a compilation issue not asm. CPUs with non-coherent caches are possible in theory (where std::atomic would need to do explicit flushing to make a store visible), but in practice no implementation runs std::thread across cores with non-coherent caches.

If I don't use fences, how long could it take a core to see another core's writes? is highly related, I've written basically this answer at least a few times before. (But this looks like a good place for an answer specifically about this, without getting into the weeds of which barriers do what.)

There might be some very minor secondary effects of blocking later loads that could maybe compete with RFOs (for this core to get exclusive access to a cache line to commit a store). The CPU always tries to drain the store buffer as fast as possible (by committing to L1d cache). As soon as a store commits to L1d cache, it becomes globally visible to all other cores. (Because they're coherent; they'd still have to make a share request...)

Getting the current core to write-back some store data to L3 cache (especially in shared state) could reduce the miss penalty if the load on another core happens somewhat after this store commits. But there are no good ways to do that. Creating a conflict miss in L1d and L2 maybe, if producer performance is unimportant other than creating low latency for the next read.

On x86, Intel Tremont (low power Silvermont series) will introduce cldemote (_mm_cldemote) that writes back a line as far as an outer cache, but not all the way to DRAM. (clwb could possibly help, but does force the store to go all the way to DRAM. Also, the Skylake implementation is just a placeholder and works like clflushopt.)

Fun fact: non-seq_cst stores/loads on PowerPC can store-forward between logical cores on the same physical core, making stores visible to some other cores before they become globally visible to all other cores. This is AFAIK the only real hardware mechanism for threads to not agree on a global order of stores to all objects. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?. On other ISAs, including ARMv8 and x86, it's guaranteed that stores become visible to all other cores at the same time (via commit to L1d cache).

For loads, CPUs already prioritize demand loads over any other memory accesses (because of course execution has to wait for them.) A barrier before a load could only delay it.

That might happen to be optimal by coincidence of timing, if that makes it see the store it was waiting for instead of going "too soon" and seeing the old cached boring value. But there's generally no reason to assume or ever predict that a pause or barrier could be a good idea ahead of a load.

A barrier after a load shouldn't help either. Later loads or stores might be able to start, but out-of-order CPUs generally do stuff in oldest-first priority so later loads probably can't fill up all the outstanding load buffers before this load gets a chance to get its load request sent off-core (assuming a cache miss because another core stored recently.)

I guess I could imagine a benefit to a later barrier if this load address wasn't ready for a while (pointer-chasing situation) and the max number of off-core requests were already in-flight when the address did become known.

Any possible benefit is almost certainly not worth it; if there was that much useful work independent of this load that it could fill up all the off-core request buffers (LFBs on Intel) then it might well not be on the critical path and it's probably a good thing to have those loads in flight.

Recommended topics

Hot tags