False sharing and 128-byte alignment/padding
Asked Answered
S

3

30

While doing some research about lock-free/wait-free algorithms, I stumbled upon the false sharing problem. Digging a bit more led me to Folly's source code (Facebook's C++ library) and more specifically to this header file and the definition of the FOLLY_ALIGN_TO_AVOID_FALSE_SHARING macro (currently at line 130). What surprised me the most at first glance was the value: 128 (i.e.: instead of 64)...

/// An attribute that will cause a variable or field to be aligned so that
/// it doesn't have false sharing with anything at a smaller memory address.
#define FOLLY_ALIGN_TO_AVOID_FALSE_SHARING __attribute__((__aligned__(128)))

AFAIK, cache blocks on modern CPUs are 64 bytes long and actually, every resources I found so far on the matter, including this article from Intel, talk about 64 bytes aligning and padding to help work around false sharing.

Still, the folks at Facebook align and pad their class members to 128 bytes when needed. Then I found out the beginning of an explanation just above FOLLY_ALIGN_TO_AVOID_FALSE_SHARING's definition:

enum {
    /// Memory locations on the same cache line are subject to false
    /// sharing, which is very bad for performance.  Microbenchmarks
    /// indicate that pairs of cache lines also see interference under
    /// heavy use of atomic operations (observed for atomic increment on
    /// Sandy Bridge).  See FOLLY_ALIGN_TO_AVOID_FALSE_SHARING
    kFalseSharingRange = 128
};

While it gives me a bit more details, I still feel I need some insights. I'm curious about how the sync of contiguous cache lines, or any RMW operation on them could interfere with each other under heavy use of atomic operations. Can someone please enlighten me on how this can even possibly happen?

Socialism answered 22/3, 2015 at 20:59 Comment(5)
I enjoyed reading Herb Sutter's explanation at Dr Dobbs some time ago: drdobbs.com/parallel/eliminate-false-sharing/217500206Brightwork
An important concept here is "cache associativity"Rusell
Intel optimization manual, chapter 2.1.5.4. The spatial prefetcher strives to keep pairs of cache lines in the L2 cache.Nought
@Hans: I would accept that as an answer (found it in chapter 2.2.5.4 though).Socialism
Yours is probably more up-to-date. You can write your own answer.Nought
T
6

As Hans pointed out in a comment, some info about this can be found in "Intel® 64 and IA-32 architectures optimization reference manual", in section 3.7.3 "Hardware Prefetching for Second-Level Cache", about the Intel Core microarchitecture:

"Streamer — Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line."

Thurman answered 14/4, 2019 at 8:52 Comment(2)
In modern microarchitectures like Sandybridge-family, it's actually the L2 spatial prefetcher that likes to complete aligned pairs of cache lines; the streamer is separate. (Intel Core is ancient, like Core2Duo Conroe and Penryn from ~2007.)Attribute
bytes aligned and false sharing cause performance diff on x86-64 has benchmarks on some unknown recent x86 microarchitecture. See also Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size for more details about why 128 makes sense on modern x86.Attribute
C
0

It seems that, while Intel uses 64 bytes cache lines, there are various other architectures that use 128 bytes cache lines... for example:

http://goo.gl/8L6cUl

Power Systems use 128-byte length cache lines. Compared to Intel processors (64-byte cache lines), these larger cache lines have...

I've found scattered around internet notes that other architectures, even old ones, do the same:

http://goo.gl/iNAZlX

SGI MIPS R10000 Processor in the Origin Computer

The processor has a cache line size of 128 bytes.

So probably Facebook programmers wanted to play it safe and didn't want to have a big collection of #define/#if based on processor architecture, with the risk that some newer Intel processors had a 128 bytes cache line and no one remembered to correct the code.

Contrasty answered 23/3, 2015 at 6:17 Comment(1)
Still, pair of cache line also see interference ... observed for atomic increment on Sandy Bridge ... It really seems they updated the FOLLY_ALIGN_TO_AVOID_FALSE_SHARING value because of a series of tests on that particular architecture. i.e.: 64 bytes alignment and padding wasn't enough according to these results.Socialism
E
-1

Whether you use atomic operations or not, the cache has a "cache-line", which is the smallest unit that the cache operates on. This ranges from 32 to 128 bytes, depending on processor model. False sharing is when elements within the same cache-line are "shared" between different threads (that run on different processors[1]). When this happens, one processor updating "its value", will force all other processors to "get rid of its copy" of that data. It gets worse in case of atomic operations, because to perform any atomic operation, the processor performing the operation will need to ensure all other processors have got rid of "their copies" before it can update the value (to ensure no other processor is using an "old" value before the value has been updated) - this requires a lot of cache-maintenance messages to be propagated through the system, and processors to re-load the values that they previously had in the cache.

So, from a performance perspective, if you have variables that are used by one thread, separate them out to their own cache-line (in the example in the original post, this is assumed to be 128 bytes) by aligning the data to that value - meaning that each lump of data starts on an even cache-line boundary and no other processor will "share" the same data (unless you are genuinely sharing the data between thread - at which point you HAVE to do the relevant cache-maintenance to ensure the data is correctly updated between the processors)

[1] Or processor cores in modern CPU with multiple cores. For simplicity, I've used the term "processor" or "processors" to correspond to either real processor sockets or processor cores within one socket. For this discussion, the distinction is pretty much irrelevant.

Emancipated answered 22/3, 2015 at 21:23 Comment(4)
I understand the concept of false-sharing. My question was about the value they use to do the alignment and padding (128 bytes) vs. the architecture that forced them to choose this value (i.e.: Sandy Bridge, which has 64 bytes cache lines AFAIK).Socialism
I thought some of the older Intel architectures also used 128-byte cache-lines. Maybe they haven't ONLY got processors produced in the last few years?Emancipated
You've made a thorough answer and I thank you for that. But please read my admittedly-too-long-and-probably-unclear question: Microbenchmarks indicate that pairs of cache lines also see interference ... (observed ... on Sandy Bridge). There's no much room for doubt concerning the architecture here. They've changed the alignment to 128 bytes because of bad results with Sandy Bridge. What I would like to understand is why and how a larger padding would reduce false-sharing in that particular case. @Hans' comment to the question might be of interest for you as it has been for me.Socialism
Unfortuantely, I'm not particularly familiar with the latest Intel processors. When I was doing benchmarking at AMD some 12-15 years back, I was quite in tune with what the differences between models were. Not so much these days, as my home machines all have AMD processors (still loyal to the brand where I worked now about 10 years ago) and only using ARM processors at work for the last 8 or so years. [Well, my work machine itself has had an Intel processor of some variant or other, but I've not really paid much attention to what or how it's behaving]Emancipated

© 2022 - 2024 — McMap. All rights reserved.