Why 16-byte alignment for `long double`?
Asked Answered
A

0

1

64 bit architecture like x86-64 have word size of 64bits. In this case, if a memory access crosses over the word boundary, then it will require double the time to access data. So alignment is required. - This is what I know. Correct me if I am wrong.

Now, GCC uses 16 byte alignment (msvc atleast uses 8 byte alignment) for long double whose non-padding size is 10 bytes. But anyways, with 8 byte alignment it requires 2 read cycles and it is the same case with 16 byte alignment. So why stricter 16 byte alignment? What is the purpose of alignment other than that I mentioned above?

Also, in fact, since the non-padding part of long double (the 80-bit x87 extended FP) is 10 bytes, actually 4 byte alignment is sufficient for that. In this case also, it can read data within 2 read cycles (either 4-6 or 8-2). So, also explain where this assumption has gone wrong.

(The actual sizeof(long double) is 12 in the i386 System V ABI, 16 in x86-64 System V. Multiples of their respective alignof() of 4 and 16)

Ashby answered 10/7, 2021 at 6:2 Comment(7)
x86-64 doesn't have a "word size", that's not a meaningful concept for x86, which can load/store any power-of-2 width from 1 byte to 32 bytes (or 64 with AVX-512 capable CPUs), with near-equal performance as long as the load doesn't cross a 64-byte cache line boundary.Mucky
Huh!? Then why alignment is required?Ashby
Probably because older CPUs could only do 16-byte load/store efficiently when it was naturally aligned. Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?. 80-bit x87 is slow anyway, so it's a somewhat questionable decision to use that much extra space in arrays, although fld m80 does decode into a 2-byte and an 8-byte load, so 8 or 16 byte alignment are both sufficient to avoid cache-line splits in either of the halves. But only if the size is 16 bytes, so you might as well make the align match the size for SSE copying.Mucky
Intel Optimization Manual recommends a 16 byte alignment for 80bit long double, but it does not explain why or what the impact is. My quick experiments showed no impact of (mis)alignment, only of crossing cache line boundaries, as expected.Wicklow
Re: the concept of a "word": see Weird data sizes? and Does Word length == number of bits transferred between memory and CPU per access?, and my longish answer at How does the CPU reads a double value? re: how CPUs access memory through cache. Also What's the actual effect of successful unaligned accesses on x86? / How can I accurately benchmark unaligned access speed on x86_64Mucky
Re: x87 performance on modern CPUs (and AMD K8, which was the relevant ISA when the x86-64 System V ABI was being designed), see Did any compiler fully use Intel x87 80-bit floating point? on retrocomputing.SEMucky
Aligning each object to a multiple of its size is the easiest way to ensure that no object crosses a cache line boundary.Magbie

© 2022 - 2024 — McMap. All rights reserved.