Should the cache padding size of x86-64 be 128 bytes?

I found a comment from crossbeam.

Starting from Intel's Sandy Bridge, spatial prefetcher is now pulling pairs of 64-byte cache lines at a time, so we have to align to 128 bytes rather than 64.

Sources:

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

https://github.com/facebook/folly/blob/1b5288e6eea6df074758f877c849b6e73bbb9fbb/folly/lang/Align.h#L107

I didn't find the line in Intel's manual saying that. But until the latest commit, folly still uses 128 bytes padding, which makes it convincing to me. So I started to write code to see if I can observe this behavior. Here is my code.

#include <thread>

int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

int main() {
    std::thread t1(update, 0);
    std::thread t2(update, 1);
    std::thread t3(update, 2);
    std::thread t4(update, 3);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Compiler Explorer

My CPU is Ryzen 3700X. When the indexes are 0, 1, 2, 3, it takes ~1.2s to finish. When the indexes are 0, 16, 32, 48, it takes ~200ms to finish. When the indexes are 0, 32, 64, 96, it takes ~200ms to finish, which is exactly the same as before. I also tested them on an Intel machine and it gave me a similar result.

From this micro bench I can't see the reason to use 128 bytes padding over 64 bytes padding. Did I get anything wrong?

Intel's optimization manual does describe the L2 spatial prefetcher in SnB-family CPUs. Yes, it tries to complete 128B-aligned pairs of 64B lines, when there's spare memory bandwidth (off-core request tracking slots) when the first line is getting pulled in.

Your microbenchmark doesn't show any significant time difference between 64 vs. 128 byte separation. Without any actual false sharing (within the same 64 byte line), after some initial chaos, it quickly reaches a state where each core has exclusive ownership of the cache line it's modifying. That means no further L1d misses, thus no requests to L2 which would trigger the L2 spatial prefetcher.

Unlike if for example two pairs of threads contending over separate atomic<int> variables in adjacent (or not) cache lines. Or false-sharing with them. Then L2 spatial prefetching could couple the contention together, so all 4 threads are contending with each other instead of 2 independent pairs. Basically any case where cache lines actually are bouncing back and forth between cores, L2 spatial prefetching can make it worse if you aren't careful.

(The L2 prefetcher doesn't keep trying indefinitely to complete pairs of lines of every valid line it's caching; that would hurt cases like this where different cores are repeatedly touching adjacent lines, more than it helps anything.)

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size includes an answer with a longer benchmark; I haven't looked at it recently, but I think it's supposed to demo destructive interference at 64 bytes but not 128. The answer there unfortunately doesn't mention L2 spatial prefetching as one of the effects that can cause some destructive interference (although not as much as a 128-byte line size in an outer level of cache, especially not if it's an inclusive cache).

Perf counters reveal a difference even with your benchmark

There is more initial chaos that we can measure with performance counters for your benchmark. On my i7-6700k (quad core Skylake with Hyperthreading; 4c8t, running Linux 5.16), I improved the source so I could compile with optimization without defeating the memory access, and with a CPP macro so I could set the stride (in bytes) from the compiler command line. Note the ~500 memory-order mis-speculation pipeline nukes (machine_clears.memory_ordering) when we use adjacent lines. Actual number is quite variable, from 200 to 850, but there's still negligible effect on the overall time.

Adjacent lines, 500 +- 300 machine clears

$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.22 msec task-clock                #    3.958 CPUs utilized            ( +-  0.12% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               126      page-faults               #  224.752 /sec                     ( +-  0.35% )
     2,180,391,747      cycles                    #    3.889 GHz                      ( +-  0.12% )
     2,003,039,378      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,118,661      uops_issued.any           #    2.861 G/sec                    ( +-  0.00% )
     2,003,739,959      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
               494      machine_clears.memory_ordering #  881.172 /sec                     ( +-  9.00% )

          0.141534 +- 0.000342 seconds time elapsed  ( +-  0.24% )

vs with 128-byte separation, only a very few machine clears

$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.01 msec task-clock                #    3.957 CPUs utilized            ( +-  0.13% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               124      page-faults               #  221.203 /sec                     ( +-  0.16% )
     2,180,048,243      cycles                    #    3.889 GHz                      ( +-  0.13% )
     2,003,038,553      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,084,990      uops_issued.any           #    2.862 G/sec                    ( +-  0.00% )
     2,003,707,895      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
                22      machine_clears.memory_ordering #   39.246 /sec                     ( +-  9.68% )

          0.141506 +- 0.000342 seconds time elapsed  ( +-  0.24% )

Presumably there's some dependence on how Linux schedules threads to logical cores on this 4c8t machine. Related:

What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? - actual false sharing within one line is much worse for logical cores sharing a physical core, but for adjacent lines there's probably no effect: the L2 is the same per physical core, and both lines will stay hot in L1d.
Why flush the pipeline for Memory Order Violation caused by other logical processors?

vs. actual false-sharing within one line: 10M machine clears

The store buffer (and store forwarding) get a bunch of increments done for every false-sharing machine clear so it's not as bad as one might expect. (And would be far worse with atomic RMWs, like std::atomic<int> fetch_add, since there every single increment needs direct access to L1d cache as it executes.) Why does false sharing still affect non atomics, but much less than atomics?

$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            809.98 msec task-clock                #    3.835 CPUs utilized            ( +-  0.42% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               122      page-faults               #  152.953 /sec                     ( +-  0.22% )
     3,152,973,230      cycles                    #    3.953 GHz                      ( +-  0.42% )
     2,003,038,681      instructions              #    0.65  insn per cycle           ( +-  0.00% )
     2,868,628,070      uops_issued.any           #    3.596 G/sec                    ( +-  0.41% )
     2,934,059,729      uops_executed.thread      #    3.678 G/sec                    ( +-  0.30% )
        10,810,169      machine_clears.memory_ordering #   13.553 M/sec                    ( +-  0.90% )

           0.21123 +- 0.00124 seconds time elapsed  ( +-  0.59% )

Improved benchmark- align the array, and volatile to allow optimization

I used volatile so I could enable optimization. I assume you compiled with optimization disabled, so int j was also getting stored/reloaded inside the loop.

And I used alignas(128) counter[] so we'd be sure the start of the array was in two pairs of 128-byte lines, not spread across three.

#include <thread>

alignas(128) volatile int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

static const int stride = SIZE/sizeof(counter[0]);
int main() {
    std::thread t1(update, 0*stride);
    std::thread t2(update, 1*stride);
    std::thread t3(update, 2*stride);
    std::thread t4(update, 3*stride);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Perf counters reveal a difference even with your benchmark

Adjacent lines, 500 +- 300 machine clears

vs with 128-byte separation, only a very few machine clears

vs. actual false-sharing within one line: 10M machine clears

Improved benchmark- align the array, and volatile to allow optimization

Recommended topics

Hot tags