terminology: A load won't generate an RFO, it doesn't need ownership. It only sends a request to share the data. Multiple cores can be reading from the same physical address in parallel, each with a copy of it hot in their L1d cache.
Other cores writing the line will send RFOs which invalidate the shared copy in our cache, though, and yes that could come in after reading one or two elements of a cache line before all have been read. (I updated your question with a description of the problem in those terms.)
Hadi's SIMD load is a good idea to grab all the data with one instruction.
As far as we know, _mm_load_si128()
is in practice atomic for its 8-byte chunks, so it can safely replace the .load(mo_relaxed)
of the atomic. But see Per-element atomicity of vector load/store and gather/scatter? - there's no clear written guarantee of this.
If you used _mm256_loadu_si256()
, beware of GCC's default tuning -mavx256-split-unaligned-load
: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? So that's another good reason to use an aligned load, besides needing to avoid a cache-line split.
But we're writing in C, not asm, so we need to worry about some of the other things that std::atomic
with mo_relaxed
does: specifically that repeated loads from the same address might not give the same value. You probably need to dereference a volatile __m256i*
to kind of simulate what load(mo_relaxed)
.
You can use atomic_thread_fence()
if you want stronger ordering; I think in practice C++11 compilers that support Intel intrinsics will order volatile dereferences wrt. fences the same way as std::atomic
loads/stores. In ISO C++, volatile
objects are still subject to data-race UB, but in real implementations that can for example compile a Linux kernel, volatile
can be used for multi-threading. (Linux rolls its own atomics with volatile
and inline asm, and this is I think considered supported behaviour by gcc/clang.) Given what volatile
actually does (object in memory matches the C++ abstract machine), it basically just automatically works, despite any rules-lawyer concerns that it's technically UB. It's UB that compilers can't know or care about because that's the whole point of volatile
.
In practice there's good reason to believe that entire aligned 32-byte loads/store on Haswell and later are atomic. Certainly for reading from L1d into the out-of-order backend, but also even for transferring cache lines between cores. (e.g. multi-socket K10 can tear on 8-byte boundaries with HyperTransport, so this really is a separate issue). The only problem for taking advantage of it is the lack of any written guarantee or CPU-vendor-approved way to detect this "feature".
Other than that, for portable code it could help to hoist auto three = something.three;
out of the branch; a branch mispredict gives the core much more time to invalidate the line before the 3rd load.
But compilers will probably not respect that source change, and only load it in the case that needs it. But branchless code would always load it, so maybe we should encourage that with
bar(one, two, one == 0 ? something.three : 0);
Broadwell can run 2 loads per clock cycle (like all mainstream x86 since Sandybridge and K8); uops typically execute in oldest-ready-first order so it's likely (if this load did have to wait for data from another core) that our 2 load uops will execute in the first cycle possible after the data arrives.
The 3rd load uop will hopefully run in the cycle after that, leaving a very small window for an invalidate to cause a problem.
Or on CPUs with only 1-per clock loads, still having all 3 loads adjacent in the asm reduces the window for invalidations.
But if one == 0
is rare, then three
often isn't needed at all, so unconditional loading brings a risk of unnecessary requests for it. So you have to consider that tradeoff when tuning, if you can't cover all the data with one SIMD load.
As discussed in comments, software prefetch could potentially help to hide some of the inter-core latency.
But you have to prefetch much later than you would for a normal array, so finding places in your code that are often running ~50 to ~100 cycles before f1()
is called is a hard problem and can "infect" a lot of other code with details unrelated to its normal operation. And you need a pointer to the right cache line.
You need the PF to be late enough that the demand load happens a few (tens of) cycles before the prefetched data actually arrives. This is the opposite of the normal use-case, where L1d is a buffer to prefetch into and hold data from completed prefetches before demand-loads get to them. But you want load_hit_pre.sw_pf
perf events (load hit prefetch), because that means the demand load happened while the data was still in flight, before there's any chance of it being invalidated.
That means tuning is even more brittle and difficult than usual, because instead of a nearly-flat sweet spot for prefetch distance where earlier or later doesn't hurt, earlier hides more latency right up until the point where it allows invalidations, so it's a slope all the way up to a cliff. (And any too-early prefetches just make overall contention even worse.)
__builtin_prefetch
by any chance? – Heartstrickenprefetchw
); a demand load also requests the whole cache line. – Washyload_hit_pre.sw_pf
(perf event) and hide some of inter-core latency. But tuning this is hard and depends on what other stalls usually happen between the PF and the actual load; the earlier you put it, the more possibility of it being "too early"; much worse than the usual problem with tuning SW PF where there's a decent-size sweet-spot window between early-enough (to get L1d hits) and too early (evicted before use by cache conflicts, not invalidations). Here it's a slope with a cliff. – Washy