Aligning to cache line and knowing the cache line size

R

7

72

To prevent false sharing, I want to align each element of an array to a cache line. So first I need to know the size of a cache line, so I assign each element that amount of bytes. Secondly I want the start of the array to be aligned to a cache line.

I am using Linux and 8-core x86 platform. First how do I find the cache line size. Secondly, how do I align to a cache line in C. I am using the gcc compiler.

So the structure would be following for example, assuming a cache line size of 64.

element[0] occupies bytes 0-63
element[1] occupies bytes 64-127
element[2] occupies bytes 128-191

and so on, assuming of-course that 0-63 is aligned to a cache line.

Rival answered 2/9, 2011 at 9:43 Comment(4)

Perhaps this can help: #795132 – Evocator 2/9, 2011 at 9:46

But it doesn't show how to align to a cache using gcc. – Rival 2/9, 2011 at 9:53

Possible duplicate of Programmatically get the cache line size? – Cecelia 16/3, 2017 at 7:56

It's not a bad idea to use a compile-time constant of 64 bytes as the cache-line size, so the compiler can bake that into functions that care about it. Making the compiler generate code for a runtime-variable cache line size could eat up some of the benefit of aligning things, especially in cases of auto-vectorization where it helps the compiler make better code if it knows a pointer is aligned to a cache line width (which is wider than the SIMD vector width). – Rebba 12/3, 2018 at 4:32

H

42

To know the sizes, you need to look it up using the documentation for the processor, afaik there is no programatic way to do it. On the plus side however, most cache lines are of a standard size, based on intels standards. On x86 cache lines are 64 bytes, however, to prevent false sharing, you need to follow the guidelines of the processor you are targeting (intel has some special notes on its netburst based processors), generally you need to align to 64 bytes for this (intel states that you should also avoid crossing 16 byte boundries).

To do this in C or C++ requires that you use the standard aligned_alloc function or one of the compiler specific specifiers such as __attribute__((aligned(64))) or __declspec(align(64)). To pad between members in a struct to split them onto different cache lines, you need on insert a member big enough to align it to the next 64 byte boundery

Historiographer answered 2/9, 2011 at 9:50 Comment(12)

But how do I align to a cache line in c? – Rival 2/9, 2011 at 9:52

@MetallicPriest: updated my post a bit (note: there was an error in cache line size, align to 64 bytes, not 16, 16 bytes is to prevent splitting) – Historiographer 2/9, 2011 at 10:5

@MetallicPriest: gcc and g++ both support __attributes__ – Sidhu 2/9, 2011 at 10:6

Is memory mapped by mmap, aligned too? – Rival 2/9, 2011 at 10:33

@MetallicPriest: mmap & VirtualAlloc allocate page aligned memory, generally page granularity is 64kb (under windows), and since 64kb is a power of 64, it will be aligned properly. – Historiographer 2/9, 2011 at 10:45

You can get the cache line size programatically. Check here. Also you can not generalize to having 64 byte cache lines on x86. It is only true for recent ones. – Aneroidograph 20/6, 2012 at 22:11

@tothphu: a more portable way to get it is via CPUID, and as of many revisions of the Intel guides, cache lines have been 64 bytes, IIRC even the P4 (which is now ancient) had 64 byte cachelines (in fact, it did, see: osronline.com/article.cfm?article=273). also there is no need to spam the link, rather just edit your comment. – Historiographer 21/6, 2012 at 7:16

@Historiographer I seem to remember that I have read 32 bytes somewhere in Core Duo timeframe, but then my memory is probaly deceiving me. Otherwise I couldn't edit the comment I have crossed some 5 min boundary. – Aneroidograph 22/6, 2012 at 7:52

C++11 addes alignas that is portable way of specifying alignment – Pallaton 19/10, 2018 at 2:43

@Pallaton alignas officially only supports alignment up till the size of the type std::max_align_t, which is typically the alignment requirement of a long double, aka 8 or 16 bytes - not 64 unfortunately. See for example #49373787 – Carbone 20/7, 2019 at 15:40

@CarloWood: Compilers are allowed to support over-aligned types, and in practice they do. (all of gcc, clang, MSVC, ICC support alignas(64)). True that ISO C++ only requires alignas up to alignof(max_align_t), but it also doesn't specify __declspec or __attribute__. I'd call alignas portable because in real life compilers can and do support it because it's useful. Not in the same sense that behaviour required by ISO C++ is portable, sure. – Rebba 29/11, 2019 at 13:13

@Necrolis: re: earlier comments: x86 (and x86-64) page size is 4kiB. x86-64 hugepages are 2MiB or 1GiB. Yes, everything uses 64-byte cache lines since Core 2 at least, so all x86-64. Pentium II/III did use 32-byte lines, maybe even Pentium M / Core solo/duo. Over-aligning might waste a bit of space on those ancient CPUs, but it's not a big deal. On modern CPUs, L2 spatial prefetch tries to complete an aligned pair of cache lines (128 bytes) so it can sometimes make sense to align by 128. – Rebba 29/11, 2019 at 13:17

G

94

I am using Linux and 8-core x86 platform. First how do I find the cache line size.

$ getconf LEVEL1_DCACHE_LINESIZE
64

Pass the value as a macro definition to the compiler.

$ gcc -DLEVEL1_DCACHE_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` ...

At run-time sysconf(_SC_LEVEL1_DCACHE_LINESIZE) can be used to get L1 cache size.

Gripsack answered 2/9, 2011 at 14:24 Comment(8)

Where are these sysconf()s specified? POSIX / IEEE Std 1003.1-20xx ? – Peisch 16/6, 2017 at 21:20

@BrianCain pubs.opengroup.org/onlinepubs/9699919799/functions/sysconf.html – Gripsack 16/6, 2017 at 21:59

@BrianCain I use Linux, so I just did man sysconf. Linux is not exactly POSIX compilant, so that Linux-specific documentation is often more useful. Sometimes it is out of date, so you just egrep -nH -r /usr/include -e '\b_SC'. – Gripsack 16/6, 2017 at 22:1

In case of Mac, use sysctl hw.cachelinesize. – Arminius 21/1, 2018 at 0:12

Usually it's so much better to have a compile-time-constant line size that I'd rather hard-code 64 than call sysconf. The compiler won't even know it's a power of 2, so you'll have to manually do stuff like offset = ptr & (linesize-1) for remainder or bit-scan + right-shift to implement division. You can't just use / in code that's performance-sensitive. – Rebba 29/11, 2019 at 13:29

But if you used a cross compiler that wouldn't work right? Because it would get the cache line size of you current architecture and not the one of your target architecture. – Diamagnetic 12/5, 2020 at 11:14

@Diamagnetic When cross-compiling you would need to obtain that getconf LEVEL1_DCACHE_LINESIZE from your target architecture, sure. Your build system might provide it, or you'd have to hardcode it as a system-specific value into your build system. – Gripsack 11/6, 2020 at 22:34

@Diamagnetic Another method is to have arch-specific implementations in different shared libraries and load the right one at run-time. Or, more advanced users, could have their own mechanisms of using arch-specific functions, but one would need to be an expert with all the details involved (which isn't rocket science, but requires a bit of thorough reading and appreciation). – Gripsack 11/6, 2020 at 22:43