If I don't use fences, how long could it take a core to see another core's writes?
Asked Answered
H

1

14

I have been trying to Google my question but I honestly don't know how to succinctly state the question.

Suppose I have two threads in a multi-core Intel system. These threads are running on the same NUMA node. Suppose thread 1 writes to X once, then only reads it occasionally moving forward. Suppose further that, among other things, thread 2 reads X continuously. If I don't use a memory fence, how long could it be between thread 1 writing X and thread 2 seeing the updated value?

I understand that the write of X will go to the store buffer and from there to the cache, at which point MESIF will kick in and thread 2 will see the updated value via QPI. (Or at least this is what I've gleaned). I presume that the store buffer would get written to the cache either on a store fence or if that store buffer entry needs to be reused, but I don't know store buffers get allocated to writes.

Ultimately the question I'm trying to answer for myself is if it is possible for thread 2 to not see thread 1's write for several seconds in a fairly complicated application that is doing other work.

Headache answered 11/7, 2018 at 19:12 Comment(1)
If the two threads are running on the same NUMA node, then QPI would not be involved.Retrocede
C
15

Memory barriers don't make other threads see your stores any faster. (Except that blocking later loads could slightly reduce contention for committing buffered stores.)

The store buffer always tries to commit retired (known non-speculative) stores to L1d cache as fast as possible. Cache is coherent1, so that makes them globally visible because of MESI/MESIF/MOESI. The store buffer is not designed as a proper cache or write-combining buffer (although it can combine back-to-back stores to the same cache line), so it needs to empty itself to make room for new stores. Unlike a cache, it wants to keep itself empty, not full.

Note 1: not just x86; all multi-core systems of any ISA where we can run a single instance of Linux across its cores are necessarily cache coherent; Linux relies on volatile for its hand-rolled atomics to make data visible. And similarly, C++ std::atomic load/store operations with mo_relaxed are just plain asm loads and stores on all normal CPUs, relying on hardware for visibility between cores, not manual flushing. When to use volatile with multi threading? explains th. There are some clusters, or hybrid microcontroller+DSP ARM boards with non-coherent shared memory, but we don't run threads of the same process across separate coherency domains. Instead, you run a separate OS instance on each cluster node. I'm not aware of any C++ implementation where atomic<T> loads/stores include manual flush instructions. (Please let me know if there are any.)


Fences/barriers work by making the current thread wait

... until whatever visibility is required has happened via the normal mechanisms.

A simple implementation of a full barrier (mfence or a locked operation) is to stall the pipeline until the store buffer drains, but high-performance implementations can do better and allow out-of-order execution separately from the memory-order restriction.

(Unfortunately Skylake's mfence does fully block out-of-order execution, to fix the obscure SKL079 erratum involving NT loads from WC memory. But lock add or xchg or whatever only block later loads from reading L1d or the store buffer until the barrier reaches the end of the store buffer. And mfence on earlier CPUs presumably also doesn't have that problem.)


In general on non-x86 architectures (which have explicit asm instructions for weaker memory barriers, like only StoreStore fences without caring about loads), the principle is the same: block whichever operations it needs to block until this core has completed earlier operations of whatever type.

Related:


Ultimately the question I'm trying to answer for myself is if it is possible for thread 2 to not see thread 1's write for several seconds

No, the worst-case latency is maybe something like store-buffer length (56 entries on Skylake, up from 42 in BDW) times cache-miss latency, because x86's strong memory model (no StoreStore reordering) requires stores to commit in-order. But RFOs for multiple cache lines can be in flight at once, so the max delay is maybe 1/5th of that (conservative estimate: there are 10 Line Fill Buffers). There can also be contention from loads also in flight (or from other cores), but we just want an order of magnitude back-of-the-envelope number.

Lets say RFO latency (DRAM or from another core) is 300 clock cycles (basically made up) on a 3GHz CPU. So a worst-case delay for a store to become globally visible is maybe something like 300 * 56 / 5 = 3360 core clock cycles. So within an order of magnitude, worst case is about ~1 microsecond on the 3GHz CPU we're assuming. (CPU frequency cancels out, so an estimate of RFO latency in nanoseconds would have been more useful).

That's when all your stores need to wait a long time for RFOs, because they're all to locations that are uncached or owned by other cores. And none of them are to the same cache line back-to-back so none can merge in the store buffer. So normally you'd expect it to be significantly faster.

I don't think there's any plausible mechanism for it to take even a hundred microseconds, let alone a whole second.

If all your stores are to cache lines where other cores are all contending for access to the same line, your RFOs could take longer than normal, so maybe tens of microseconds, maybe even a hundred. But that kind of absolute worst case wouldn't happen by accident.

Concuss answered 11/7, 2018 at 23:52 Comment(13)
I think that we can model the time it takes for a store to reach another thread as follows: the time it takes the store to be retired (which means leaving the store buffer to either the L1D or an LFB) + the time it takes to copy the line from the private cache of the core to the private cache of the other core (or the target register). This may require an L2-L2 transfer for inclusive L2 and for different physical cores. But both of these time components can vary wildly. It's very hard to put an upper limit on that.Retrocede
@HadiBrais: Retirement from the out-of-order core (ROB) is separate from reaching the end of the store buffer and committing to L1d (to a line in M state). The store buffer decouples OoO execution/retirement from L1d commit. Retirement is a pre-condition for commit, but that's all. (A large store buffer can impact interrupt latency, because there's no way to discard / roll it back; stores that have left the ROB need to happen for correctness.) I left out OoO exec from my calculation, which could maybe be significant for timing WRT. loads in low IPC code.Concuss
But yeah, things can vary wildly, which is why I can confidently rule out a microsecond, but I'm only claiming an order of magnitude for my 1 us quick estimate of worst-case latency.Concuss
Consider a shared-memory system with multiple sockets where the two threads are on two different sockets. How much time does it for one thread to get the data of the other? If all the cores are very busy doing stuff, can it take more than a second in the worst case?Retrocede
Maybe if there is some kind of prioritization system on the interconnect between the sockets, the core waiting for the cache line might starve. But I don't know that much. Is there?Retrocede
@HadiBrais: good question. off-core latency can be high. Or with dozens of cores contending for access to the same line, especially for locked operations that hold onto it for a few clock cycles, that could be pretty slow. I thought about multi-socket systems but I guess didn't really account for them in my worst-case estimate. Maybe 100 us is plausible, maybe even 1ms if the store is stuck behind multiple other stores that all have to wait for heavy contention. I'm not aware of any priority system in the HW contention manager on Intel or AMD CPUs. (And this was tagged [intel])Concuss
Note that the OP is interested in transferring the line over QPI as mentioned in the question.Retrocede
@HadiBrais: they do say "These threads are running on the same NUMA node", but yeah other contention with threads on other sockets could delay the store you care about.Concuss
Yes, this answer looks good for the case where the two threads are running on the same NUMA node.Retrocede
If you're talking about worst-case scenarios, I woudn't count on getting an MLP of 5, I would use 1. You might not get the store MLP because the LFBs are busy doing something else (e.g., handling loads) or because the MLP is defeated by lines being stolen away by other cores before commit. There are probably other things that make it slower than usual, e.g., split cache stores, etc.Cardamom
@PeterCordes - I'm interested in this SKL079 thing. Your link goes to a comment that was deleted, I think. Do you know anything more to it? Is the claim that there is a microcode update that makes mfence slower on Skylake but avoids reordering with NT loads from WC memory? In Haswell there was HSD162 which is that NT loads from WC memory can passed locked instructions, but without a "fix" except to recommend to use mfence instead. That errata still exists in Skylake. However Haswell doesn't have the mfence errata for the same scenario.Cardamom
So purely speculatively, maybe what happened is that in Skylake, mfence was improved to use the same kind of faster mechanism that the lock instructions use (which was always confusing: if the ordering guarantees were the same, wtf is mfence _slower?), which caused mfence to suffer from the same errata as locked instructions in HSD162, so a microcode update was created to fix mfence which slowed in down in Skylake (by how much I don't know?). Agner shows mfence has having the same latency and uop distribution on SKL and HSW though...Cardamom
@BeeOnRope: I linked to my answer on another question. The last section of that answer has more details about SKL079, and my conclusions. If anyone could test mfence on HSW to see if it blocks OoO exec, that would be cool. Nice idea comparing the HSW numbers; maybe they attempted to make mfence more efficient in SKL but ended up reverting it. I had been assuming that earlier uarches were more efficient, but maybe not.Concuss

© 2022 - 2024 — McMap. All rights reserved.