TL:DR: It allows efficient stack allocation of variables with alignof(T) == 16
, such as long double
and __m128i
, and of local arrays to make SSE2 vectorization efficient.
For details on how to write asm that respects the ABI, see glibc scanf Segmentation faults when called from a function that doesn't align RSP. (scanf
is just one example of a function where compiler-generated asm in the library relies on that ABI guarantee, using movaps
to copy 16 bytes at a time to and/or from locals on the stack.)
Note that the current version of the i386 System V ABI used on Linux also requires 16-byte stack alignment1. See https://sourceforge.net/p/fbc/bugs/659/ for some history, and my comment on https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838#c91 for an attempt at summarizing the unfortunate history of how i386 GNU/Linux + GCC accidentally got into a situation where a backwards-incompat change to the i386 System V ABI was the lesser of two evils.
Windows x64 also requires 16-byte stack alignment before a call
, presumably for similar motivations as x86-64 System V.
Also, semi-related: x86-64 System V requires that global arrays of 16 bytes and larger be aligned by 16. Same for local arrays of >= 16 bytes or variable size, although that detail is only relevant across functions if you know that you're being passed the address of the start of an array, not a pointer into the middle. (Different memory alignment for different buffer sizes). It doesn't let you make any extra assumptions about an arbitrary int *
.
SSE2 is baseline for x86-64, and making the ABI efficient for types like __m128
, and for compiler auto-vectorization, was one of the design goals, I think. The ABI has to define how such args are passed as function args, or by reference.
16-byte alignment is sometimes useful for local variables on the stack (especially arrays), and guaranteeing 16-byte alignment means compilers can get it for free whenever it's useful, even if the source doesn't explicitly request it.
If the stack alignment relative to a 16-byte boundary wasn't known, every function that wanted an aligned local would need an and rsp, -16
, and extra instructions to save/restore rsp
after an unknown offset to rsp
(either 0
or -8
) e.g. using up rbp
for a frame pointer.
Without AVX, memory source operands have to be 16-byte aligned. e.g. paddd xmm0, [rsp+rdi]
faults if the memory operand is misaligned. So if alignment isn't known, you'd have to either use movups xmm1, [rsp+rdi]
/ paddd xmm0, xmm1
, or write a loop prologue / epilogue to handle the misaligned elements. For local arrays that the compiler wants to auto-vectorize over, it can simply choose to align them by 16.
Also note that early x86 CPUs (before Nehalem / Bulldozer) had a movups
instruction that's slower than movaps
even when the pointer does turn out to be aligned. (I.e. unaligned loads/stores on aligned data was extra slow, as well as preventing folding loads into an ALU instruction.) (See Agner Fog's optimization guides, microarch guide, and instruction tables for more about all of the above.)
These factors are why a guarantee is more useful than just "usually" keeping the stack aligned. Being allowed to make code which actually faults on a misaligned stack allows more optimization opportunities.
Aligned arrays also speed up vectorized memcpy
/ strcmp
/ whatever functions that can't assume alignment, but instead check for it and can jump straight to their whole-vector loops.
From a recent version of the x86-64 System V ABI (r252):
An array uses the same alignment as its elements, except that a local or global
array variable of length at least 16 bytes or a C99 variable-length array variable
always has alignment of at least 16 bytes.4
4 The alignment requirement allows the use of SSE instructions when operating on the array.
The compiler cannot in general calculate the size of a variable-length array (VLA), but it is expected
that most VLAs will require at least 16 bytes, so it is logical to mandate that VLAs have at
least a 16-byte alignment.
This is a bit aggressive, and mostly only helps when functions that auto-vectorize can be inlined, but usually there are other locals the compiler can stuff into any gaps so it doesn't waste stack space. And doesn't waste instructions as long as there's a known stack alignment. (Obviously the ABI designers could have left this out if they'd decided not to require 16-byte stack alignment.)
Spill/reload of __m128
Of course, it makes it free to do alignas(16) char buf[1024];
or other cases where the source requests 16-byte alignment.
And there are also __m128
/ __m128d
/ __m128i
locals. The compiler may not be able to keep all vector locals in registers (e.g. spilled across a function call, or not enough registers), so it needs to be able to spill/reload them with movaps
, or as a memory source operand for ALU instructions, for efficiency reasons discussed above.
Loads/stores that actually are split across a cache-line boundary (64 bytes) have significant latency penalties, and also minor throughput penalties on modern CPUs. The load needs data from 2 separate cache lines, so it takes two accesses to the cache. (And potentially 2 cache misses, but that's rare for stack memory.)
I think movups
already had that cost baked in for vectors on older CPUs where it's expensive, but it still sucks. Spanning a 4k page boundary is much worse (on CPUs before Skylake), with a load or store taking ~100 cycles if it touches bytes on both sides of a 4k boundary. (Also needs 2 TLB checks.) Natural alignment makes splits across any wider boundary impossible, so 16-byte alignment was sufficient for everything you can do with SSE2.
max_align_t
has 16-byte alignment in the x86-64 System V ABI, because of long double
(10-byte/80-bit x87). It's defined as padded to 16 bytes for some weird reason, unlike in 32-bit code where sizeof(long double) == 10
. x87 10-byte load/store is quite slow anyway (like 1/3rd the load throughput of double
or float
on Core2, 1/6th on P4, or 1/8th on K8), but maybe cache-line and page split penalties were so bad on older CPUs that they decided to define it that way. I think on modern CPUs (maybe even Core2) looping over an array of long double
would be no slower with packed 10-byte, because the fld m80
would be a bigger bottleneck than a cache-line split every ~6.4 elements.
Actually, the ABI was defined before silicon was available to benchmark on (back in ~2000), but those K8 numbers are the same as K7 (32-bit / 64-bit mode is irrelevant here). Making long double
16-byte does make it possible to copy a single one with movaps
, even though you can't do anything with it in XMM registers. (Except manipulate the sign bit with xorps
/ andps
/ orps
.)
Related: this max_align_t
definition means that malloc
always returns 16-byte aligned memory in x86-64 code. This lets you get away with using it for SSE aligned loads like _mm_load_ps
, but such code can break when compiled for 32-bit where alignof(max_align_t)
is only 8. (Use aligned_alloc
or whatever.)
Other ABI factors include passing __m128
values on the stack (after xmm0-7 have the first 8 float / vector args). It makes sense to require 16-byte alignment for vectors in memory, so they can be used efficiently by the callee, and stored efficiently by the caller. Maintaining 16-byte stack alignment at all times makes it easy for functions that need to align some arg-passing space by 16.
There are types like __m128
that the ABI guarantees have 16-byte alignment. If you define a local and take its address, and pass that pointer to some other function, that local needs to be sufficiently aligned. So maintaining 16-byte stack alignment goes hand in hand with giving some types 16-byte alignment, which is obviously a good idea.
These days, it's nice that atomic<struct_of_16_bytes>
can cheaply get 16-byte alignment, so lock cmpxchg16b
doesn't ever cross a cache line boundary. For the really rare case where you have an atomic local with automatic storage, and you pass pointers to it to multiple threads...
Footnote 1: 32-bit Linux
Not all 32-bit platforms broke backwards compatibility with existing binaries and hand-written asm the way Linux did; some like i386 NetBSD still only use the historical 4-byte stack alignment requirement from the original version of the i386 SysV ABI.
The historical 4-byte stack alignment was also insufficient for efficient 8-byte double
on modern CPUs. Unaligned fld
/ fstp
are generally efficient except when they cross a cache-line boundary (like other loads/stores), so it's not horrible, but naturally-aligned is nice.
Even before 16-byte alignment was officially part of the ABI, GCC used to enable -mpreferred-stack-boundary=4
(2^4 = 16-bytes) on 32-bit. This currently assumes the incoming stack alignment is 16 bytes (even for cases that will fault if it's not), as well as preserving that alignment. I'm not sure if historical gcc versions used to try to preserve stack alignment without depending on it for correctness of SSE code-gen or alignas(16)
objects.
ffmpeg is one well-known example that depends on the compiler to give it stack alignment: what is "stack alignment"?, e.g. on 32-bit Windows.
Modern gcc still emits code at the top of main
to align the stack by 16 (even on Linux where the ABI guarantees that the kernel starts the process with an aligned stack), but not at the top of any other function. You could use -mincoming-stack-boundary
to tell gcc how aligned it should assume the stack is when generating code.
Ancient gcc4.1 didn't seem to really respect __attribute__((aligned(16)))
or 32
for automatic storage, i.e. it doesn't bother aligning the stack any extra in this example on Godbolt, so old gcc has kind of a checkered past when it comes to stack alignment. I think the change of the official Linux ABI to 16-byte alignment happened as a de-facto change first, not a well-planned change. I haven't turned up anything official on when the change happened, but somewhere between 2005 and 2010 I think, after x86-64 became popular and the x86-64 System V ABI's 16-byte stack alignment proved useful.
At first it was a change to GCC's code-gen to use more alignment than the ABI required (i.e. using a stricter ABI for gcc-compiled code), but later it was written in to the version of the i386 System V ABI maintained at https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI (which is official for Linux at least).
@MichaelPetch and @ThomasJager report that gcc4.5 may have been the first version to have -mpreferred-stack-boundary=4
for 32-bit as well as 64-bit. gcc4.1.2 and gcc4.4.7 on Godbolt appear to behave that way, so maybe the change was backported, or Matt Godbolt configured old gcc with a more modern config.
imm8
for asub rsp, imm8
that's already needed. Yes, the overall speedup or saving in code-size is quite small in most programs, but I still think it's a good design. – Dredaal
is non-zero, and that's whyprintf
breaks if you break the stack alignment and pass FP register args. – Dreda