Should the cache padding size of x86-64 be 128 bytes?
Asked Answered
C

1

15

I found a comment from crossbeam.

Starting from Intel's Sandy Bridge, spatial prefetcher is now pulling pairs of 64-byte cache lines at a time, so we have to align to 128 bytes rather than 64.

Sources:

I didn't find the line in Intel's manual saying that. But until the latest commit, folly still uses 128 bytes padding, which makes it convincing to me. So I started to write code to see if I can observe this behavior. Here is my code.

#include <thread>

int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

int main() {
    std::thread t1(update, 0);
    std::thread t2(update, 1);
    std::thread t3(update, 2);
    std::thread t4(update, 3);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}

Compiler Explorer

My CPU is Ryzen 3700X. When the indexes are 0, 1, 2, 3, it takes ~1.2s to finish. When the indexes are 0, 16, 32, 48, it takes ~200ms to finish. When the indexes are 0, 32, 64, 96, it takes ~200ms to finish, which is exactly the same as before. I also tested them on an Intel machine and it gave me a similar result.

From this micro bench I can't see the reason to use 128 bytes padding over 64 bytes padding. Did I get anything wrong?

Circumcise answered 5/5, 2022 at 11:44 Comment(9)
Intel's optimization manual describe the L2 spatial prefetcher in SnB-family CPUs. Yes, it tries to complete 128B-aligned pairs of 64B lines, when there's spare memory bandwidth (off-core request tracking slots) when the first line is getting pulled in.Iey
@PeterCordes Thanks! In what situation, 128 bytes padding will outperform 64 bytes padding?Circumcise
@Circumcise some amd gpu opencl drivers were working better with 4096 alignment for the host buffers but it may be changed. I guess its about the dma unit working between vram and ram. I dont know if dma is used between two cpus on same motherboard (instead of their internal link) but when I had 2 gpus the higher alignment worked better to utilize more percentage of ram bandwidth by only 1 thread.Punjab
It says spatial prefetcher. Your experiment is designed to measure false sharing, unrelated to prefetching!Facial
"My CPU is Ryzen 3700X." Then it is not Intel at all but AMD, and advice for x86-64 processors is not relevant.Weisshorn
@KarlKnechtel: Ryzen 3700X is an x86-64; that includes Intel, AMD, Via, and Zhaoxin (possibly other vendors). It's true that advice for other x86-64 CPUs may not be applicable. (And software intended to run well across x86-64 CPUs in general has to care about things that are slow on some but not all x86-64 CPUs.) But note that the question says they also tested on an Intel; as my answer shows, this benchmark is not designed in a way that will really suffer due to the L2 spatial prefetcher.Iey
o_O I'm not accustomed to the terminology being used that way. I must have missed something along the way.Weisshorn
@user253751: Spatial prefetching (of adjacent lines) is very much related to false sharing. It can cause false sharing between objects in adjacent lines, not just the same line. (But in more limited circumstances, and not in an ongoing manner if the cache lines aren't continuing to bounce around between cores; see my answer).Iey
@KarlKnechtel: See The most correct way to refer to 32-bit and 64-bit versions of programs for x86-related CPUs?. If you want a term for specifically Intel's implementation of x86-64, there's IA-32e or Intel64. x86-64 is pretty much always used to mean the lowest common denominator between Intel and AMD. (e.g. I could say x86-64 guarantees that loads/stores to cacheable memory are atomic if they have a power-of-2 size and are contained within one aligned 8-byte qword. Only Intel guarantees atomicity for any misalignment within 1 cache line.)Iey
I
18

Intel's optimization manual does describe the L2 spatial prefetcher in SnB-family CPUs. Yes, it tries to complete 128B-aligned pairs of 64B lines, when there's spare memory bandwidth (off-core request tracking slots) when the first line is getting pulled in.

Your microbenchmark doesn't show any significant time difference between 64 vs. 128 byte separation. Without any actual false sharing (within the same 64 byte line), after some initial chaos, it quickly reaches a state where each core has exclusive ownership of the cache line it's modifying. That means no further L1d misses, thus no requests to L2 which would trigger the L2 spatial prefetcher.

Unlike if for example two pairs of threads contending over separate atomic<int> variables in adjacent (or not) cache lines. Or false-sharing with them. Then L2 spatial prefetching could couple the contention together, so all 4 threads are contending with each other instead of 2 independent pairs. Basically any case where cache lines actually are bouncing back and forth between cores, L2 spatial prefetching can make it worse if you aren't careful.

(The L2 prefetcher doesn't keep trying indefinitely to complete pairs of lines of every valid line it's caching; that would hurt cases like this where different cores are repeatedly touching adjacent lines, more than it helps anything.)

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size includes an answer with a longer benchmark; I haven't looked at it recently, but I think it's supposed to demo destructive interference at 64 bytes but not 128. The answer there unfortunately doesn't mention L2 spatial prefetching as one of the effects that can cause some destructive interference (although not as much as a 128-byte line size in an outer level of cache, especially not if it's an inclusive cache).


Perf counters reveal a difference even with your benchmark

There is more initial chaos that we can measure with performance counters for your benchmark. On my i7-6700k (quad core Skylake with Hyperthreading; 4c8t, running Linux 5.16), I improved the source so I could compile with optimization without defeating the memory access, and with a CPP macro so I could set the stride (in bytes) from the compiler command line. Note the ~500 memory-order mis-speculation pipeline nukes (machine_clears.memory_ordering) when we use adjacent lines. Actual number is quite variable, from 200 to 850, but there's still negligible effect on the overall time.

Adjacent lines, 500 +- 300 machine clears

$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.22 msec task-clock                #    3.958 CPUs utilized            ( +-  0.12% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               126      page-faults               #  224.752 /sec                     ( +-  0.35% )
     2,180,391,747      cycles                    #    3.889 GHz                      ( +-  0.12% )
     2,003,039,378      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,118,661      uops_issued.any           #    2.861 G/sec                    ( +-  0.00% )
     2,003,739,959      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
               494      machine_clears.memory_ordering #  881.172 /sec                     ( +-  9.00% )

          0.141534 +- 0.000342 seconds time elapsed  ( +-  0.24% )

vs with 128-byte separation, only a very few machine clears

$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            560.01 msec task-clock                #    3.957 CPUs utilized            ( +-  0.13% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               124      page-faults               #  221.203 /sec                     ( +-  0.16% )
     2,180,048,243      cycles                    #    3.889 GHz                      ( +-  0.13% )
     2,003,038,553      instructions              #    0.92  insn per cycle           ( +-  0.00% )
     1,604,084,990      uops_issued.any           #    2.862 G/sec                    ( +-  0.00% )
     2,003,707,895      uops_executed.thread      #    3.574 G/sec                    ( +-  0.00% )
                22      machine_clears.memory_ordering #   39.246 /sec                     ( +-  9.68% )

          0.141506 +- 0.000342 seconds time elapsed  ( +-  0.24% )

Presumably there's some dependence on how Linux schedules threads to logical cores on this 4c8t machine. Related:

vs. actual false-sharing within one line: 10M machine clears

The store buffer (and store forwarding) get a bunch of increments done for every false-sharing machine clear so it's not as bad as one might expect. (And would be far worse with atomic RMWs, like std::atomic<int> fetch_add, since there every single increment needs direct access to L1d cache as it executes.) Why does false sharing still affect non atomics, but much less than atomics?

$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

 Performance counter stats for './a.out' (25 runs):

            809.98 msec task-clock                #    3.835 CPUs utilized            ( +-  0.42% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
               122      page-faults               #  152.953 /sec                     ( +-  0.22% )
     3,152,973,230      cycles                    #    3.953 GHz                      ( +-  0.42% )
     2,003,038,681      instructions              #    0.65  insn per cycle           ( +-  0.00% )
     2,868,628,070      uops_issued.any           #    3.596 G/sec                    ( +-  0.41% )
     2,934,059,729      uops_executed.thread      #    3.678 G/sec                    ( +-  0.30% )
        10,810,169      machine_clears.memory_ordering #   13.553 M/sec                    ( +-  0.90% )

           0.21123 +- 0.00124 seconds time elapsed  ( +-  0.59% )

Improved benchmark- align the array, and volatile to allow optimization

I used volatile so I could enable optimization. I assume you compiled with optimization disabled, so int j was also getting stored/reloaded inside the loop.

And I used alignas(128) counter[] so we'd be sure the start of the array was in two pairs of 128-byte lines, not spread across three.

#include <thread>

alignas(128) volatile int counter[1024]{};

void update(int idx) {
    for (int j = 0; j < 100000000; j++) ++counter[idx];
}

static const int stride = SIZE/sizeof(counter[0]);
int main() {
    std::thread t1(update, 0*stride);
    std::thread t2(update, 1*stride);
    std::thread t3(update, 2*stride);
    std::thread t4(update, 3*stride);
    t1.join();
    t2.join();
    t3.join();
    t4.join();
}
Iey answered 5/5, 2022 at 12:30 Comment(3)
Following your instructions, I created another benchmark. And it does show a significant difference between 64 bytes and 128 bytes on Intel but not on AMD. I noticed that on both platforms the result sometimes varies a lot. I think that is related to the scheduler.Circumcise
@QuarticCat: Interesting, thanks for confirming my guess that true sharing between pairs of threads would be a problem for Intel CPUs if using adjacent lines. And interesting that it's not for AMD. BTW, it's possible to enable/disable HW prefetchers via MSRs, so one could even verify that disabling the L2 spatial prefetcher (or all L2 prefetch) removed the performance penalty on Intel. (e.g. Correctly disable Hardware Prefetching with MSR in Skylake / Is L2 HW prefetcher really helpful?)Iey
@PeterCordes mail-archive.com/[email protected]/msg265387.html. I think its not fully clear when the spatial prefetcher actually can cause ping-ponging. IIRC past benchmarks I've run have also been super ambiguous and varied a lot even within intel HW based on the microarch.Tinner

© 2022 - 2024 — McMap. All rights reserved.