Loading an entire cache line at once to avoid contention for multiple elements of it

Asked 30/5, 2019 at 21:21 Answered 31/5, 2019 at 1:34

Solved c++multithreading x86 micro-optimization false-sharing

Assuming that there are three pieces of data that I need from a heavily contended cache line, is there a way to load all three things "atomically" so as to avoid more than one roundtrip to any other core?

I don't actually need a correctness guarantee of atomicity for a snapshot of all 3 members, just in the normal case that all three items are read in the same clock cycle. I want to avoid the case where the cache line arrives, but then an invalidate request comes in before all 3 objects are read. That would result in the 3rd access needing to send another request to share the line, making contention even worse.

For example,

class alignas(std::hardware_destructive_interference_size) Something {
    std::atomic<uint64_t> one;
    std::uint64_t two;
    std::uint64_t three;
};

void bar(std::uint64_t, std::uint64_t, std::uint64_t);

void f1(Something& something) {
    auto one = something.one.load(std::memory_order_relaxed);
    auto two = something.two;
    if (one == 0) {
        bar(one, two, something.three);
    } else {
        bar(one, two, 0);
    }

}

void f2(Something& something) {
    while (true) {
        baz(something.a.exchange(...));
    }
}

Can I somehow ensure that one, two and three all get loaded together without multiple RFOs under heavy contention (assume f1 and f2 are running concurrently)?

The target architecture / platform for the purposes of this question is Intel x86 Broadwell, but if there is a technique or compiler intrinsic that allows doing something best-effort like this somewhat portably, that would be great as well.

Heartstricken answered 30/5, 2019 at 21:21 Comment(8)

Forgive me if I misunderstand, but isn't a single cache line always loaded atomically? Do you want to load multiple, contiguous cache lines for exclusive access, atomically? – Monosyllable 30/5, 2019 at 21:30

@alterigel Sorry. In this context, by atomic i meant something that gets served in a single roundtrip/RFO. I’ll update the question to clarify – Heartstricken 30/5, 2019 at 21:33

@alterigel updated, thanks – Heartstricken 30/5, 2019 at 22:20

@Heartstricken did you check the x86 specs ? as far as i know there are some cpu instructions to prefetch memory into cache before you want to use it. – Nielsen 30/5, 2019 at 22:36

@Raxvan, I am not familiar with these. Are you talking about __builtin_prefetch by any chance? – Heartstricken 30/5, 2019 at 22:37

@Raxvan: Software prefetch is probably the opposite of helpful here. The line is highly contended so it won't just sit around in our L1d cache until actual loads read it. If there's any time after the prefetch arrives before the load, an RFO from another core will probably invalidate it. If you prefetch a couple instructions before a normal load, that's also useless (unless it's a write-prefetch with prefetchw); a demand load also requests the whole cache line. – Washy 31/5, 2019 at 0:56

SW PF could help if you can prefetch a little ahead of when you read so the demand load(s) will be a load_hit_pre.sw_pf (perf event) and hide some of inter-core latency. But tuning this is hard and depends on what other stalls usually happen between the PF and the actual load; the earlier you put it, the more possibility of it being "too early"; much worse than the usual problem with tuning SW PF where there's a decent-size sweet-spot window between early-enough (to get L1d hits) and too early (evicted before use by cache conflicts, not invalidations). Here it's a slope with a cliff. – Washy 31/5, 2019 at 1:8

But anyway, SW PF does nothing to help get all 3 loads done simultaneously; it's just as helpful regardless of whether you use Hadi's vector-load idea or not. – Washy 31/5, 2019 at 1:10

terminology: A load won't generate an RFO, it doesn't need ownership. It only sends a request to share the data. Multiple cores can be reading from the same physical address in parallel, each with a copy of it hot in their L1d cache.

Other cores writing the line will send RFOs which invalidate the shared copy in our cache, though, and yes that could come in after reading one or two elements of a cache line before all have been read. (I updated your question with a description of the problem in those terms.)

Hadi's SIMD load is a good idea to grab all the data with one instruction.

As far as we know, _mm_load_si128() is in practice atomic for its 8-byte chunks, so it can safely replace the .load(mo_relaxed) of the atomic. But see Per-element atomicity of vector load/store and gather/scatter? - there's no clear written guarantee of this.

If you used _mm256_loadu_si256(), beware of GCC's default tuning -mavx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? So that's another good reason to use an aligned load, besides needing to avoid a cache-line split.

But we're writing in C, not asm, so we need to worry about some of the other things that std::atomic with mo_relaxed does: specifically that repeated loads from the same address might not give the same value. You probably need to dereference a volatile __m256i* to kind of simulate what load(mo_relaxed).

You can use atomic_thread_fence() if you want stronger ordering; I think in practice C++11 compilers that support Intel intrinsics will order volatile dereferences wrt. fences the same way as std::atomic loads/stores. In ISO C++, volatile objects are still subject to data-race UB, but in real implementations that can for example compile a Linux kernel, volatile can be used for multi-threading. (Linux rolls its own atomics with volatile and inline asm, and this is I think considered supported behaviour by gcc/clang.) Given what volatile actually does (object in memory matches the C++ abstract machine), it basically just automatically works, despite any rules-lawyer concerns that it's technically UB. It's UB that compilers can't know or care about because that's the whole point of volatile.

In practice there's good reason to believe that entire aligned 32-byte loads/store on Haswell and later are atomic. Certainly for reading from L1d into the out-of-order backend, but also even for transferring cache lines between cores. (e.g. multi-socket K10 can tear on 8-byte boundaries with HyperTransport, so this really is a separate issue). The only problem for taking advantage of it is the lack of any written guarantee or CPU-vendor-approved way to detect this "feature".

Other than that, for portable code it could help to hoist auto three = something.three; out of the branch; a branch mispredict gives the core much more time to invalidate the line before the 3rd load.

But compilers will probably not respect that source change, and only load it in the case that needs it. But branchless code would always load it, so maybe we should encourage that with

    bar(one, two, one == 0 ? something.three : 0);

Broadwell can run 2 loads per clock cycle (like all mainstream x86 since Sandybridge and K8); uops typically execute in oldest-ready-first order so it's likely (if this load did have to wait for data from another core) that our 2 load uops will execute in the first cycle possible after the data arrives.

The 3rd load uop will hopefully run in the cycle after that, leaving a very small window for an invalidate to cause a problem.

Or on CPUs with only 1-per clock loads, still having all 3 loads adjacent in the asm reduces the window for invalidations.

But if one == 0 is rare, then three often isn't needed at all, so unconditional loading brings a risk of unnecessary requests for it. So you have to consider that tradeoff when tuning, if you can't cover all the data with one SIMD load.

As discussed in comments, software prefetch could potentially help to hide some of the inter-core latency.

But you have to prefetch much later than you would for a normal array, so finding places in your code that are often running ~50 to ~100 cycles before f1() is called is a hard problem and can "infect" a lot of other code with details unrelated to its normal operation. And you need a pointer to the right cache line.

You need the PF to be late enough that the demand load happens a few (tens of) cycles before the prefetched data actually arrives. This is the opposite of the normal use-case, where L1d is a buffer to prefetch into and hold data from completed prefetches before demand-loads get to them. But you want load_hit_pre.sw_pf perf events (load hit prefetch), because that means the demand load happened while the data was still in flight, before there's any chance of it being invalidated.

That means tuning is even more brittle and difficult than usual, because instead of a nearly-flat sweet spot for prefetch distance where earlier or later doesn't hurt, earlier hides more latency right up until the point where it allows invalidations, so it's a slope all the way up to a cliff. (And any too-early prefetches just make overall contention even worse.)

Washy answered 31/5, 2019 at 1:34 Comment(15)

Thanks for the detailed answer! It's always interesting reading your answers and comments :) I have a few followups. The first is regarding this part - Certainly for reading from L1d into the out-of-order backend What is an out of order backend? – Heartstricken 31/5, 2019 at 4:8

Broadwell can run 2 loads per clock cycle Why only 2? Could you point to where this is specified? There might be more interesting information there... I assume then that there is no way to load the entire contents of a heavily contended cacheline in a single cycle? – Heartstricken 31/5, 2019 at 4:8

Other than that, for portable code it could help to hoist auto three = something.three; out of the branch; Does the compiler usually not implement such optimizations itself? – Heartstricken 31/5, 2019 at 4:9

@Curious: Compilers will do if-conversion into branchless if they decide it's probably worth it. It's legal here because a pointer to the class isn't allowed to point to a partial object at the end of a page (followed by an unmapped page). And because x86 asm doesn't care about simultaneous reads if another thread is writing. But note that the C++ abstract machine doesn't read three at all if one is non-zero. This transform at the source level could introduce data-race UB if !one means another thread might be writing it. – Washy 31/5, 2019 at 4:24

Anyway, normally compilers put conditional work inside conditional branches. If the code-gen is branchy, we may want the opposite because of this special circumstance, so no, this isn't something you'd ever expect a compiler to want to do. False sharing and high contention are very very bad, so compilers optimize for the normal case where there's no need to group loads super tightly (and there is something to be gained there: not doing the load at all in the other branch) – Washy 31/5, 2019 at 4:26

But note that the C++ abstract machine doesn't read three at all if one is non-zero. This transform at the source level could introduce data-race UB if !one means another thread might be writing it. Ah, this makes sense, is it unreasonable to expect compilers to inspect the other code and see that this actually does not happen? – Heartstricken 31/5, 2019 at 4:28

@Curious: out-of-order backend = the machinery that does out-of-order execution in a CPU core. The front-end fetches/decodes instructions to feed to the back-end. I was talking there about getting load data into physical registers (and onto the bypass forwarding network for instructions that use the result). – Washy 31/5, 2019 at 4:28

Does the backend not care about branches? I don't think I folly fully :( – Heartstricken 31/5, 2019 at 4:29

@Curious: what "other code"? You're writing a function that takes a pointer. It could be pointing to anything, and any number of other unknown threads could also have a pointer to it, and could be running unknown code. Anyway like I said, the as-if rule still allows the transformation (because x86 asm doesn't have data-race UB), but compilers may be resistant to it. And no it won't do whole-program analysis to find out more about the pointers passed to your thread functions and what else might have references to them. – Washy 31/5, 2019 at 4:32

Does the backend not care about branches? I think you're completely missing the point I was making. Let me simplify: "For reading from L1d into registers". But anyway, the front-end has to use branch prediction to follow branches and feed the (probably) correct sequence of instructions (decoded to uops) into the back-end. When the back-end executes a conditional or indirect branch, that just means checking the prediction, and possibly having to roll back and tell the front-end to feed it the actual correct path. That's what could make the three load many cycles later. – Washy 31/5, 2019 at 4:36

From that description, I assumed you were explaining how the load to three could occur concurrently with the other two. Could you give an example of how it would cause the load for three to occur much later? If the front-end predicts the branch leading to the load as not actually getting executed? – Heartstricken 31/5, 2019 at 4:38

@Curious: Broadwell can load an entire cache line in 1 cycle, using two 32-byte loads. You have no guarantee that both loads actually execute in the same cycle, but that's likely if they were both stalled waiting for the same cache line. You can read basically anything about the Sandybridge microarchitecture family to learn that it has load execution units on 2 ports, up from 1 in Nehalem. e.g. realworldtech.com/haswell-cpu is a great deep-dive into Haswell (basically the same as Broadwell), and the last page has a full block diagram. – Washy 31/5, 2019 at 4:39

But also Agner Fog's instruction tables, uops.info, instlatx64.atw.hu, and Intel's own documents (optimization manual, and even the semi-useful latency/throughput tables in their intrinsics finder) all tell you that loads have 2-per-clock throughput on Sandybridge-family. So basically any performance-details info would include this. Also en.wikichip.org/wiki/intel/microarchitectures/…. – Washy 31/5, 2019 at 4:42

@Curious: I already gave an example in my answer of how the load of three could be delayed: if it's after a conditional branch which mispredicts, it runs later by at least a dozen cycles, the branch mispredict penalty. The mispredict can't be detected until after the load of one has completed. – Washy 31/5, 2019 at 4:44

Thanks for the discussion! I think I have my answer :) – Heartstricken 31/5, 2019 at 4:44

As long as the size of std::atomic<uint64_t> is at most 16 bytes (which is the case in all major compilers), the total size of one, two, and three does not exceed 32 bytes. Therefore, you can define a union of __m256i and Something where the Something field is aligned to 32-bytes to ensure that it is fully contained within a single 64-byte cache line. To load all of the three values at the same time, you can use a single 32-byte AVX load uop. The corresponding compiler intrinsic is _mm256_load_si256, which causes the compiler to emit the VMOVDQA ymm1, m256 instruction. This instruction is supported with a single load uop decoding on Intel Haswell and later.

The 32-byte alignment is really only needed to ensure that all of the fields are contained within a 64-byte cache line. However, _mm256_load_si256 requires the specified memory address to be 32-byte aligned. Alternatively, _mm256_loadu_si256 could be used instead in case the address is not 32-byte aligned.

Walther answered 30/5, 2019 at 23:14 Comment(7)

Ah interesting. One important thing to note here is that the first variable has to be atomic. Is the compiler intrinsic and the corresponding instruction atomic and impose at least an acquire memory order? – Heartstricken 30/5, 2019 at 23:36

@Heartstricken The x86 ISA only guarantees atomicity for 8-byte accesses that don't cross a cache line (see: Atomicity on x86). Although the manual doesn't say that an aligned and cacheable VMOVDQA ymm1, m256 is atomic, it is atomic on Haswell and later in practice. However, the complication here is that the semantics of std::atomic<uint64_t> are not preserved when accessing it through the __m256i field. – Walther 31/5, 2019 at 0:6

@Heartstricken and Hadi: see also Per-element atomicity of vector load/store and gather/scatter? - we know in practice that this is fine, but there's no written guarantee in the x86 manuals that wider vector loads will never have tearing inside an 8-byte-aligned chunk. – Washy 31/5, 2019 at 0:36

@Hadi: you definitely want alignment; AMD CPUs can introduce tearing across boundaries narrower than a full cache line. (At least K8/K10 could, probably still Bulldozer-family) Of course Bulldozer/Ryzen will decode a 32-byte load as two 16-bit halves anyway, but hopefully they can both execute in the same clock cycle when the cache line arrives, and avoid having to request it again if an invalidate arrives the next cycle. – Washy 31/5, 2019 at 0:38

Could you explain the complication a bit more? I dont think I follow.. Also thanks @PeterCordes – Heartstricken 31/5, 2019 at 1:38

@Curious: std::atomic has special rules that make it well-defined to read while another thread is writing it. e.g. in practice it can't assume that loading the same pointer twice will get the same value. _mm_load_si256 is like a regular dereference, and does not do this. In practice a dereference of a volatile __m256i* pointer should be equivalent to a memory_order_relaxed load on x86. For stronger ordering you'd also need atomic_thread_fence. (It may be an implementation detail how fences affect volatile, not just std::atomic, objects.) – Washy 31/5, 2019 at 1:43

Thanks for the answer! If I could accept two answers I would.. But some of the details in the second answer helped converge to a solution (for now at least...), so going with that one. – Heartstricken 31/5, 2019 at 4:45