Should I align data to their data type or cpu cache line size?
Asked Answered
S

1

0

Data is usually aligned with its own data type, i.e a 32-bit int is usually aligned to 4 bytes, this makes loading/storing them more efficient for the processor.

Now when does cache line alignment come into play? If x64 cache line size is 64 bytes then should I make each data aligned to 64 bytes? that seems like a waste of memory.

What is the relation between these two types of alignment? and if cpu interaction with cache line is always 64-bits at a time then why does data type alignment even matter?

Swearword answered 31/7, 2023 at 15:43 Comment(0)
S
2

Data is usually aligned with its own data type, i.e a 32-bit int is usually aligned to 4 bytes, this makes loading/storing them more efficient for the processor.

On some architectures, it's not so much a matter of efficiency but of your code working at all. Attempting to access a misaligned object can produce a trap. Also, it is not necessarily the case that the natural alignment for a given data type is the same as the size of that data type. The natural alignment cannot, in practice, be larger than the type's size, but it can be smaller.

Now when does cache line alignment come into play? If x64 cache line size is 64 bytes then should I make each data aligned to 64 bytes? that seems like a waste of memory.

Indeed so. And counterproductive, too. One of the ways cache helps, and among the reasons that cache line size is generally several times larger than the machine's native word size, is that an access to a word at one address is very frequently followed by accesses to words at nearby addresses. Thus, one often provides for future reads by loading the cache line for the current one. If every object were aligned to the cache line size (supposing that were even feasible), you would thereby throw away a lot of the advantage otherwise obtained by caching.

What is the relation between these two types of alignments?

Cache lines are ordinarily aligned more strictly than any native data type, so an object aligned to a cache line boundary will also be aligned properly for its data type. That also means that no naturally aligned object of native data type will straddle two cache lines. Other than that, I'm not really sure what you might asking.

and if cpu interaction with cache line is always 64-bits at a time then why does data type alignment even matter?

I guess you meant 64 bytes, not bits. But the question is anyway ill-conceived. Any involvement of cache is a detail of CPU interaction with memory. And because cache line alignment is stricter than any native data type's, an object has the same alignment in cache as it has in main memory. Caching and alignment are pretty much orthogonal matters.

Overall, aligning objects naturally for their data types is a fairly important consideration (but not necessarily essential), but to a first approximation, aligning individual objects to cache lines is of no particular import. Cache and cache lines become important in the context of patterns of access to multiple objects. Ideally, one will structure the data and code so that cache does not get invalidated or flushed (or, therefore, reloaded) any more than necessary, but that has little to do with the alignment of individual objects.

Sharasharai answered 31/7, 2023 at 17:26 Comment(19)
My understanding of the term "natural alignment" is that it means being aligned by its size, for objects with power-of-2 sizes. (Which makes it impossible for a naturally aligned object to be split across a wider boundary, as you later mention.) What you're talking about in the early part of your answer, like sizeof(long double) == 12 / alignof(long double) == 4 in the i386 System V ABI for example (so 2 bytes of padding for the actual 10-byte data) is the required alignment (in that case just by software, not hardware), but I wouldn't call it "natural".Miscreance
I actually did mean 64-bits at a time, according to the link I posted x64 updates cache lines based on its bus size which is 64-bits, so it does burst of 8-byte data until it fill ups the cache line. I actually know of certain architectures which would trap on unaligned load/store but why is that? if CPU actually loads/stores in cache line blocks then data type alignment doesn't really do anything?Swearword
Good answer. In the last paragraph, you might mention that aligning an array of int to start at a cache-line boundary can improve locality, so looping over the whole array will touch the minimum number of cache lines. For example if it's small then all 16 elements can come from the same cache line.Miscreance
@PeterCordes, I am using "natural alignment" to mean the alignment that is optimal for accessing an object of a given type. Perhaps that is a minority definition, but I am not alone in using it, and "required alignment" is not right for that concept in contexts where differently-aligned objects can still be accessed.Sharasharai
@Dan: Aligning an object to its size (natural alignment) makes sure that accessing one int or uint64_t only needs to access one cache line. If you have an int at address 0x...1fff, then its first byte will come from one cache line, its last 3 bytes from another cache line (in another page, which is way more expensive for hardware to handle.) One reason for HW to not allow unaligned loads/stores is to avoid having to deal with cache-line and page splits, and to avoid needing a shifter on the critical path of a load unit.Miscreance
@Dan: See also **How much of ‘What Every Programmer Should Know About Memory’ is still valid?**/ Are there any modern CPUs where a cached byte store is actually slower than a word store? / Can modern x86 hardware not store a single byte to memory? / Are word-aligned loads faster than unaligned loads on x64 processors? / How can I accurately benchmark unaligned access speed on x86_64?Miscreance
@JohnBollinger: good point, even if such alignment is required/guaranteed by the ABI, you're talking about hardware in that case. And thanks for the link to Eric's answer. I hadn't seen that usage of the term before; I guess that's a more general meaning.Miscreance
That makes sense. When a CPU has a read/write miss and needs to read from memory to cache (or write back from cache to memory) does it do so in cache line block sizes? or does it update only by the data size like an int? say it needs to read a single int into cache, does it read an entire 64 byte and discard everything but the int?Swearword
Too add to my question (hopefully the last one), what are some good scenarios which cache aligning is beneficial? I'm fussing the start of instructions or in case of a hot loop to assure every fetch of the start of a loop is within a cache line?Swearword
@Dan, generally speaking, modern CPUs access main memory only through their caches. For their part, caches interact with the memory below in cache-line units. And having read a cache line's worth of data into cache, what would it even mean to "discard" part of it? The data are there, and the CPU's MMU knows they are there. The CPU can access any of that cached data without refreshing the cache from main memory, and that's what the second paragraph of this answer is about.Sharasharai
@Dan, I am not aware of any scenario where alignment to a cache line is useful in and of itself. However, it can be useful for a small-enough collection of related data to all fit into one cache line, or for a larger set to occupy fewer cache lines instead of more. Aligning data to a cache line boundary may help with that, but depending on the data size, that may not be the only way to achieve the same advantage.Sharasharai
Replacement policies indicate how cpu uses predication to evict a single entry from cache and load something to replace it. What I don't get (and can't find an article on) is how would it update a single spot in cache line if cpu loads the entire block regardless? replacing the entire cache line every time doesn't seem like a good idea, it might have data in there that it needs. @PeterCordes in case you have heard of this before.Swearword
@JohnBollinger loop alignment seems to be a thing actually: https://mcmap.net/q/56752/-why-is-my-loop-much-faster-when-it-is-contained-in-one-cache-lineSwearword
@Dan: Code alignment is totally different from aligning the data accessed by code. Modern x86 has a lot of complicated stuff going on, like caching the uops of decoded instructions because x86 machine code is so hard to decode.Miscreance
@Dan: A cache replacement policy is about choosing which whole line to evict on a cache miss (which triggers filling a new line). A store instruction (eventually when it commits from the store buffer to L1d cache) modifies part of an existing line in cache, marking it as "dirty" (needing write-back when it's eventually evicted). Caches only track clean/dirty (and valid = present at all) at cache-line granularity, not individual int sized chunks. (Every line needs a tag and other metadata; using larger cache lines amortizes that overhead. 64B is also the DDR SDRAM burst size.)Miscreance
@Dan: Other than locality for a whole array, one reason you'd align int data by more than just 4 bytes is so you can more efficiently process 4 ints or 8 ints at once from the start of the array with SIMD instructions like SSE2 or AVX2 paddd. But the reason there is not really related to cache, it's efficiency of a 16-byte movdqa vs. movdqu load instruction and the load/store execution units. In modern x86, cache lines get transferred between cores and memory controllers over wide busses, like 32 bytes wide for Intel's ring bus.Miscreance
I see so each line is 64 bytes and updates occur one line at a time.Swearword
@Dan: Yes. Seriously, go take a couple hours to read parts of akkadia.org/drepper/cpumemory.pdf - it's extremely good. Then have a look at How much of ‘What Every Programmer Should Know About Memory’ is still valid? . If you have questions left after that (especially which wikipedia pages like en.wikipedia.org/wiki/Cache_replacement_policies don't answer), then ask some followups.Miscreance
Will do that, thank you for sharing the links.Swearword

© 2022 - 2024 — McMap. All rights reserved.