Aligning to cache line and knowing the cache line size
Asked Answered
R

7

72

To prevent false sharing, I want to align each element of an array to a cache line. So first I need to know the size of a cache line, so I assign each element that amount of bytes. Secondly I want the start of the array to be aligned to a cache line.

I am using Linux and 8-core x86 platform. First how do I find the cache line size. Secondly, how do I align to a cache line in C. I am using the gcc compiler.

So the structure would be following for example, assuming a cache line size of 64.

element[0] occupies bytes 0-63
element[1] occupies bytes 64-127
element[2] occupies bytes 128-191

and so on, assuming of-course that 0-63 is aligned to a cache line.

Rival answered 2/9, 2011 at 9:43 Comment(4)
Perhaps this can help: #795132Evocator
But it doesn't show how to align to a cache using gcc.Rival
Possible duplicate of Programmatically get the cache line size?Cecelia
It's not a bad idea to use a compile-time constant of 64 bytes as the cache-line size, so the compiler can bake that into functions that care about it. Making the compiler generate code for a runtime-variable cache line size could eat up some of the benefit of aligning things, especially in cases of auto-vectorization where it helps the compiler make better code if it knows a pointer is aligned to a cache line width (which is wider than the SIMD vector width).Rebba
H
42

To know the sizes, you need to look it up using the documentation for the processor, afaik there is no programatic way to do it. On the plus side however, most cache lines are of a standard size, based on intels standards. On x86 cache lines are 64 bytes, however, to prevent false sharing, you need to follow the guidelines of the processor you are targeting (intel has some special notes on its netburst based processors), generally you need to align to 64 bytes for this (intel states that you should also avoid crossing 16 byte boundries).

To do this in C or C++ requires that you use the standard aligned_alloc function or one of the compiler specific specifiers such as __attribute__((aligned(64))) or __declspec(align(64)). To pad between members in a struct to split them onto different cache lines, you need on insert a member big enough to align it to the next 64 byte boundery

Historiographer answered 2/9, 2011 at 9:50 Comment(12)
But how do I align to a cache line in c?Rival
@MetallicPriest: updated my post a bit (note: there was an error in cache line size, align to 64 bytes, not 16, 16 bytes is to prevent splitting)Historiographer
@MetallicPriest: gcc and g++ both support __attributes__Sidhu
Is memory mapped by mmap, aligned too?Rival
@MetallicPriest: mmap & VirtualAlloc allocate page aligned memory, generally page granularity is 64kb (under windows), and since 64kb is a power of 64, it will be aligned properly.Historiographer
You can get the cache line size programatically. Check here. Also you can not generalize to having 64 byte cache lines on x86. It is only true for recent ones.Aneroidograph
@tothphu: a more portable way to get it is via CPUID, and as of many revisions of the Intel guides, cache lines have been 64 bytes, IIRC even the P4 (which is now ancient) had 64 byte cachelines (in fact, it did, see: osronline.com/article.cfm?article=273). also there is no need to spam the link, rather just edit your comment.Historiographer
@Historiographer I seem to remember that I have read 32 bytes somewhere in Core Duo timeframe, but then my memory is probaly deceiving me. Otherwise I couldn't edit the comment I have crossed some 5 min boundary.Aneroidograph
C++11 addes alignas that is portable way of specifying alignmentPallaton
@Pallaton alignas officially only supports alignment up till the size of the type std::max_align_t, which is typically the alignment requirement of a long double, aka 8 or 16 bytes - not 64 unfortunately. See for example #49373787Carbone
@CarloWood: Compilers are allowed to support over-aligned types, and in practice they do. (all of gcc, clang, MSVC, ICC support alignas(64)). True that ISO C++ only requires alignas up to alignof(max_align_t), but it also doesn't specify __declspec or __attribute__. I'd call alignas portable because in real life compilers can and do support it because it's useful. Not in the same sense that behaviour required by ISO C++ is portable, sure.Rebba
@Necrolis: re: earlier comments: x86 (and x86-64) page size is 4kiB. x86-64 hugepages are 2MiB or 1GiB. Yes, everything uses 64-byte cache lines since Core 2 at least, so all x86-64. Pentium II/III did use 32-byte lines, maybe even Pentium M / Core solo/duo. Over-aligning might waste a bit of space on those ancient CPUs, but it's not a big deal. On modern CPUs, L2 spatial prefetch tries to complete an aligned pair of cache lines (128 bytes) so it can sometimes make sense to align by 128.Rebba
G
94

I am using Linux and 8-core x86 platform. First how do I find the cache line size.

$ getconf LEVEL1_DCACHE_LINESIZE
64

Pass the value as a macro definition to the compiler.

$ gcc -DLEVEL1_DCACHE_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` ...

At run-time sysconf(_SC_LEVEL1_DCACHE_LINESIZE) can be used to get L1 cache size.

Gripsack answered 2/9, 2011 at 14:24 Comment(8)
Where are these sysconf()s specified? POSIX / IEEE Std 1003.1-20xx ?Peisch
@BrianCain pubs.opengroup.org/onlinepubs/9699919799/functions/sysconf.htmlGripsack
@BrianCain I use Linux, so I just did man sysconf. Linux is not exactly POSIX compilant, so that Linux-specific documentation is often more useful. Sometimes it is out of date, so you just egrep -nH -r /usr/include -e '\b_SC'.Gripsack
In case of Mac, use sysctl hw.cachelinesize.Arminius
Usually it's so much better to have a compile-time-constant line size that I'd rather hard-code 64 than call sysconf. The compiler won't even know it's a power of 2, so you'll have to manually do stuff like offset = ptr & (linesize-1) for remainder or bit-scan + right-shift to implement division. You can't just use / in code that's performance-sensitive.Rebba
But if you used a cross compiler that wouldn't work right? Because it would get the cache line size of you current architecture and not the one of your target architecture.Diamagnetic
@Diamagnetic When cross-compiling you would need to obtain that getconf LEVEL1_DCACHE_LINESIZE from your target architecture, sure. Your build system might provide it, or you'd have to hardcode it as a system-specific value into your build system.Gripsack
@Diamagnetic Another method is to have arch-specific implementations in different shared libraries and load the right one at run-time. Or, more advanced users, could have their own mechanisms of using arch-specific functions, but one would need to be an expert with all the details involved (which isn't rocket science, but requires a bit of thorough reading and appreciation).Gripsack
H
42

To know the sizes, you need to look it up using the documentation for the processor, afaik there is no programatic way to do it. On the plus side however, most cache lines are of a standard size, based on intels standards. On x86 cache lines are 64 bytes, however, to prevent false sharing, you need to follow the guidelines of the processor you are targeting (intel has some special notes on its netburst based processors), generally you need to align to 64 bytes for this (intel states that you should also avoid crossing 16 byte boundries).

To do this in C or C++ requires that you use the standard aligned_alloc function or one of the compiler specific specifiers such as __attribute__((aligned(64))) or __declspec(align(64)). To pad between members in a struct to split them onto different cache lines, you need on insert a member big enough to align it to the next 64 byte boundery

Historiographer answered 2/9, 2011 at 9:50 Comment(12)
But how do I align to a cache line in c?Rival
@MetallicPriest: updated my post a bit (note: there was an error in cache line size, align to 64 bytes, not 16, 16 bytes is to prevent splitting)Historiographer
@MetallicPriest: gcc and g++ both support __attributes__Sidhu
Is memory mapped by mmap, aligned too?Rival
@MetallicPriest: mmap & VirtualAlloc allocate page aligned memory, generally page granularity is 64kb (under windows), and since 64kb is a power of 64, it will be aligned properly.Historiographer
You can get the cache line size programatically. Check here. Also you can not generalize to having 64 byte cache lines on x86. It is only true for recent ones.Aneroidograph
@tothphu: a more portable way to get it is via CPUID, and as of many revisions of the Intel guides, cache lines have been 64 bytes, IIRC even the P4 (which is now ancient) had 64 byte cachelines (in fact, it did, see: osronline.com/article.cfm?article=273). also there is no need to spam the link, rather just edit your comment.Historiographer
@Historiographer I seem to remember that I have read 32 bytes somewhere in Core Duo timeframe, but then my memory is probaly deceiving me. Otherwise I couldn't edit the comment I have crossed some 5 min boundary.Aneroidograph
C++11 addes alignas that is portable way of specifying alignmentPallaton
@Pallaton alignas officially only supports alignment up till the size of the type std::max_align_t, which is typically the alignment requirement of a long double, aka 8 or 16 bytes - not 64 unfortunately. See for example #49373787Carbone
@CarloWood: Compilers are allowed to support over-aligned types, and in practice they do. (all of gcc, clang, MSVC, ICC support alignas(64)). True that ISO C++ only requires alignas up to alignof(max_align_t), but it also doesn't specify __declspec or __attribute__. I'd call alignas portable because in real life compilers can and do support it because it's useful. Not in the same sense that behaviour required by ISO C++ is portable, sure.Rebba
@Necrolis: re: earlier comments: x86 (and x86-64) page size is 4kiB. x86-64 hugepages are 2MiB or 1GiB. Yes, everything uses 64-byte cache lines since Core 2 at least, so all x86-64. Pentium II/III did use 32-byte lines, maybe even Pentium M / Core solo/duo. Over-aligning might waste a bit of space on those ancient CPUs, but it's not a big deal. On modern CPUs, L2 spatial prefetch tries to complete an aligned pair of cache lines (128 bytes) so it can sometimes make sense to align by 128.Rebba
W
14

Another simple way is to just cat the /proc/cpuinfo:

grep cache_alignment /proc/cpuinfo
Whoremaster answered 2/6, 2012 at 7:17 Comment(1)
Perhaps you want to remove a useless use of cat.Notion
C
9

There's no completely portable way to get the cacheline size. But if you're on x86/64, you can call the cpuid instruction to get everything you need to know about the cache - including size, cacheline size, how many levels, etc...

http://softpixel.com/~cwright/programming/simd/cpuid.php

(scroll down a little bit, the page is about SIMD, but it has a section getting the cacheline.)

As for aligning your data structures, there's also no completely portable way to do it. GCC and VS10 have different ways to specify alignment of a struct. One way to "hack" it is to pad your struct with unused variables until it matches the alignment you want.

To align your mallocs(), all the mainstream compilers also have aligned malloc functions for that purpose.

Colorable answered 2/9, 2011 at 14:52 Comment(0)
R
8

posix_memalign or valloc can be used to align allocated memory to a cache line.

Rival answered 2/9, 2011 at 9:56 Comment(4)
I know this is your own question, but for future readers you could answer both parts of it :-)Earhart
Steve, do you know if memory mapped by mmap is aligned to a cache line.Rival
I don't think it's guaranteed by Posix, but I also wouldn't be in the least surprised if linux always selects addresses that are page-aligned, never mind just cache-line aligned. Posix says that if the caller specifies the first parameter (address hint), that has to be page-aligned, and the mapping itself is always a whole number of pages. That's strongly suggestive without actually guaranteeing anything.Earhart
Yes, mmap only works in terms of pages, and pages are always larger than cache lines. Even in some theoretical weird architecture, there are extremely good reasons why cache lines won't be larger than pages (caches are normally physically tagged, so one line can't be split across 2 virtual pages without extreme pain for the CPU designers).Rebba
C
3

Here's a table I made that has most Arm/Intel processors on it. You can use it for reference when defining constants, that way you don't have to generalize the cache line size for all architectures.

For C++, hopefully, we will soon see hardware interface size which should be an accurate way to get this information (assuming you tell the compiler your target architecture).

Clarisclarisa answered 29/11, 2019 at 11:36 Comment(1)
Compilers are reluctant to implement hardware_destructive_interference_size because you really want it to be a compile-time-constant, but it can't always be if you're compiling for a "generic" target that could run on multiple CPUs of the same ISA. A conservative choice would be possible but not guaranteed future-proof. (Like 128 bytes to account for current x86 CPU with 64-byte lines and an L2 spatial prefetch that likes to complete an aligned pair of lines. (mainstream Intel))Rebba
T
2

If anyone is curious about how to do this easily in C++, I've built a library with a CacheAligned<T> class which handles determining the cache line size as well as the alignment for your T object, referenced by calling .Ref() on your CacheAligned<T> object. You can also use Aligned<typename T, size_t Alignment> if you know the cache line size beforehand, or just want to stick with the very common value of 64 (bytes).

https://github.com/NickStrupat/Aligned

Tedtedd answered 23/1, 2015 at 6:22 Comment(7)
@James - alignas is C++11. Its not available for C++03. And it won't work on a number of Apple platforms. On some of their OSes, Apple provides and ancient C++ Standard Library that pretends to be C++11, but lacks unique_ptr, alignas, etc.Spiderwort
@James also, the standard only requires alignas to support up to 16 bytes, so any higher value won't be portable. And since virtually all modern processors have a cache line size of 64 bytes, alignas isn't useful unless you know your compiler supports alignas(64).Tedtedd
alignas is also in C11, not just C++11.Rsfsr
alignas officially only supports alignment up till the size of the type std::max_align_t, which is typically the alignment requirement of a long double, aka 8 or 16 bytes - not 64 unfortunately.Carbone
@NickStrupat It seems that support for alignment to cache line sizes has finally been added to C++17. My last comment seems also not to be correct anymore for C++17 (the problem was merely that operator new would not guaranteed return memory aligned better than std::max_align_t). I just found this: en.cppreference.com/w/cpp/thread/…Carbone
@CarloWood You're right about the C++17 addition. The only advantage remaining for my library and its underlying get_cachline_size function is that it can retrieve that information at run-time. The downside is that you lose possible compiler optimizations if the cache line size is known at compile time.Tedtedd
@NickStrupat After posting this comment, I tried it out and discovered that neither gcc nor clang support it... Apparently they went for option 3 in lists.llvm.org/pipermail/cfe-dev/2018-May/058138.html (I read the whole thread; it's long but to summarize -- they have no clue how to implement it and were thinking about filing a Defect Report). Nevertheless, your library will of course have the exact same ABI/ODR issues. I'm starting to feel that simply using 64 bytes everywhere for now is my best option :/.Carbone

© 2022 - 2024 — McMap. All rights reserved.