Why 16-byte alignment for `long double`? - McMap

About

Why 16-byte alignment for `long double`?

Asked 10/7, 2021 at 6:2 Answered 10/7, 2021 at 6:2

x86-64 cpu-architecture memory-alignment abi long-double

A

0

1

64 bit architecture like x86-64 have word size of 64bits. In this case, if a memory access crosses over the word boundary, then it will require double the time to access data. So alignment is required. - This is what I know. Correct me if I am wrong.

Now, GCC uses 16 byte alignment (msvc atleast uses 8 byte alignment) for long double whose non-padding size is 10 bytes. But anyways, with 8 byte alignment it requires 2 read cycles and it is the same case with 16 byte alignment. So why stricter 16 byte alignment? What is the purpose of alignment other than that I mentioned above?

Also, in fact, since the non-padding part of long double (the 80-bit x87 extended FP) is 10 bytes, actually 4 byte alignment is sufficient for that. In this case also, it can read data within 2 read cycles (either 4-6 or 8-2). So, also explain where this assumption has gone wrong.

(The actual sizeof(long double) is 12 in the i386 System V ABI, 16 in x86-64 System V. Multiples of their respective alignof() of 4 and 16)

Ashby answered 10/7, 2021 at 6:2 Comment(7)

x86-64 doesn't have a "word size", that's not a meaningful concept for x86, which can load/store any power-of-2 width from 1 byte to 32 bytes (or 64 with AVX-512 capable CPUs), with near-equal performance as long as the load doesn't cross a 64-byte cache line boundary. – Mucky 10/7, 2021 at 18:50

Huh!? Then why alignment is required? – Ashby 10/7, 2021 at 18:52

Probably because older CPUs could only do 16-byte load/store efficiently when it was naturally aligned. Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?. 80-bit x87 is slow anyway, so it's a somewhat questionable decision to use that much extra space in arrays, although fld m80 does decode into a 2-byte and an 8-byte load, so 8 or 16 byte alignment are both sufficient to avoid cache-line splits in either of the halves. But only if the size is 16 bytes, so you might as well make the align match the size for SSE copying. – Mucky 10/7, 2021 at 18:55

Intel Optimization Manual recommends a 16 byte alignment for 80bit long double, but it does not explain why or what the impact is. My quick experiments showed no impact of (mis)alignment, only of crossing cache line boundaries, as expected. – Wicklow 10/7, 2021 at 19:6

Re: the concept of a "word": see Weird data sizes? and Does Word length == number of bits transferred between memory and CPU per access?, and my longish answer at How does the CPU reads a double value? re: how CPUs access memory through cache. Also What's the actual effect of successful unaligned accesses on x86? / How can I accurately benchmark unaligned access speed on x86_64 – Mucky 10/7, 2021 at 21:6

Re: x87 performance on modern CPUs (and AMD K8, which was the relevant ISA when the x86-64 System V ABI was being designed), see Did any compiler fully use Intel x87 80-bit floating point? on retrocomputing.SE – Mucky 10/7, 2021 at 21:9

Aligning each object to a multiple of its size is the easiest way to ensure that no object crosses a cache line boundary. – Magbie 10/7, 2021 at 22:12

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.