Intel's optimization manual does describe the L2 spatial prefetcher in SnB-family CPUs. Yes, it tries to complete 128B-aligned pairs of 64B lines, when there's spare memory bandwidth (off-core request tracking slots) when the first line is getting pulled in.
Your microbenchmark doesn't show any significant time difference between 64 vs. 128 byte separation. Without any actual false sharing (within the same 64 byte line), after some initial chaos, it quickly reaches a state where each core has exclusive ownership of the cache line it's modifying. That means no further L1d misses, thus no requests to L2 which would trigger the L2 spatial prefetcher.
Unlike if for example two pairs of threads contending over separate atomic<int>
variables in adjacent (or not) cache lines. Or false-sharing with them. Then L2 spatial prefetching could couple the contention together, so all 4 threads are contending with each other instead of 2 independent pairs. Basically any case where cache lines actually are bouncing back and forth between cores, L2 spatial prefetching can make it worse if you aren't careful.
(The L2 prefetcher doesn't keep trying indefinitely to complete pairs of lines of every valid line it's caching; that would hurt cases like this where different cores are repeatedly touching adjacent lines, more than it helps anything.)
Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size includes an answer with a longer benchmark; I haven't looked at it recently, but I think it's supposed to demo destructive interference at 64 bytes but not 128. The answer there unfortunately doesn't mention L2 spatial prefetching as one of the effects that can cause some destructive interference (although not as much as a 128-byte line size in an outer level of cache, especially not if it's an inclusive cache).
Perf counters reveal a difference even with your benchmark
There is more initial chaos that we can measure with performance counters for your benchmark. On my i7-6700k (quad core Skylake with Hyperthreading; 4c8t, running Linux 5.16), I improved the source so I could compile with optimization without defeating the memory access, and with a CPP macro so I could set the stride (in bytes) from the compiler command line. Note the ~500 memory-order mis-speculation pipeline nukes (machine_clears.memory_ordering
) when we use adjacent lines. Actual number is quite variable, from 200 to 850, but there's still negligible effect on the overall time.
Adjacent lines, 500 +- 300 machine clears
$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out
Performance counter stats for './a.out' (25 runs):
560.22 msec task-clock # 3.958 CPUs utilized ( +- 0.12% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
126 page-faults # 224.752 /sec ( +- 0.35% )
2,180,391,747 cycles # 3.889 GHz ( +- 0.12% )
2,003,039,378 instructions # 0.92 insn per cycle ( +- 0.00% )
1,604,118,661 uops_issued.any # 2.861 G/sec ( +- 0.00% )
2,003,739,959 uops_executed.thread # 3.574 G/sec ( +- 0.00% )
494 machine_clears.memory_ordering # 881.172 /sec ( +- 9.00% )
0.141534 +- 0.000342 seconds time elapsed ( +- 0.24% )
vs with 128-byte separation, only a very few machine clears
$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out
Performance counter stats for './a.out' (25 runs):
560.01 msec task-clock # 3.957 CPUs utilized ( +- 0.13% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
124 page-faults # 221.203 /sec ( +- 0.16% )
2,180,048,243 cycles # 3.889 GHz ( +- 0.13% )
2,003,038,553 instructions # 0.92 insn per cycle ( +- 0.00% )
1,604,084,990 uops_issued.any # 2.862 G/sec ( +- 0.00% )
2,003,707,895 uops_executed.thread # 3.574 G/sec ( +- 0.00% )
22 machine_clears.memory_ordering # 39.246 /sec ( +- 9.68% )
0.141506 +- 0.000342 seconds time elapsed ( +- 0.24% )
Presumably there's some dependence on how Linux schedules threads to logical cores on this 4c8t machine. Related:
vs. actual false-sharing within one line: 10M machine clears
The store buffer (and store forwarding) get a bunch of increments done for every false-sharing machine clear so it's not as bad as one might expect. (And would be far worse with atomic RMWs, like std::atomic<int>
fetch_add
, since there every single increment needs direct access to L1d cache as it executes.) Why does false sharing still affect non atomics, but much less than atomics?
$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out
Performance counter stats for './a.out' (25 runs):
809.98 msec task-clock # 3.835 CPUs utilized ( +- 0.42% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
122 page-faults # 152.953 /sec ( +- 0.22% )
3,152,973,230 cycles # 3.953 GHz ( +- 0.42% )
2,003,038,681 instructions # 0.65 insn per cycle ( +- 0.00% )
2,868,628,070 uops_issued.any # 3.596 G/sec ( +- 0.41% )
2,934,059,729 uops_executed.thread # 3.678 G/sec ( +- 0.30% )
10,810,169 machine_clears.memory_ordering # 13.553 M/sec ( +- 0.90% )
0.21123 +- 0.00124 seconds time elapsed ( +- 0.59% )
Improved benchmark- align the array, and volatile to allow optimization
I used volatile
so I could enable optimization. I assume you compiled with optimization disabled, so int j
was also getting stored/reloaded inside the loop.
And I used alignas(128) counter[]
so we'd be sure the start of the array was in two pairs of 128-byte lines, not spread across three.
#include <thread>
alignas(128) volatile int counter[1024]{};
void update(int idx) {
for (int j = 0; j < 100000000; j++) ++counter[idx];
}
static const int stride = SIZE/sizeof(counter[0]);
int main() {
std::thread t1(update, 0*stride);
std::thread t2(update, 1*stride);
std::thread t3(update, 2*stride);
std::thread t4(update, 3*stride);
t1.join();
t2.join();
t3.join();
t4.join();
}