You can see the effect on code-gen from using GNU C / C++ __builtin_assume_aligned
.
gcc 7 and earlier targeting x86 (and ICC18) prefer to use a scalar prologue to reach an alignment boundary, then an aligned vector loop, then a scalar epilogue to clean up any leftover elements that weren't a multiple of a full vector.
Consider the case where the total number of elements is known at compile time to be a multiple of the vector width, but the alignment isn't known. If you knew the alignment, you don't need either a prologue or epilogue. But if not, you need both. The number of left-over elements after the last aligned vector is not known.
This Godbolt compiler explorer link shows these functions compiled for x86-64 with ICC18, gcc7.3, and clang6.0. clang unrolls very aggressively, but still uses unaligned stores. This seems like a weird way to spend that much code-size for a loop that just stores.
// aligned, and size a multiple of vector width
void set42_aligned(int *p) {
p = (int*)__builtin_assume_aligned(p, 64);
for (int i=0 ; i<1024 ; i++ ) {
*p++ = 0x42;
}
}
# gcc7.3 -O3 (arch=tune=generic for x86-64 System V: p in RDI)
lea rax, [rdi+4096] # end pointer
movdqa xmm0, XMMWORD PTR .LC0[rip] # set1_epi32(0x42)
.L2: # do {
add rdi, 16
movaps XMMWORD PTR [rdi-16], xmm0
cmp rax, rdi
jne .L2 # }while(p != endp);
rep ret
This is pretty much exactly what I'd do by hand, except maybe unrolling by 2 so OoO exec could discover the loop exit branch being not-taken while still chewing on the stores.
Thus unaligned version includes a prologue and epilogue:
// without any alignment guarantee
void set42(int *p) {
for (int i=0 ; i<1024 ; i++ ) {
*p++ = 0x42;
}
}
~26 instructions of setup, vs. 2 from the aligned version
.L8: # then a bloated loop with 4 uops instead of 3
add eax, 1
add rdx, 16
movaps XMMWORD PTR [rdx-16], xmm0
cmp ecx, eax
ja .L8 # end of main vector loop
# epilogue:
mov eax, esi # then destroy the counter we spent an extra uop on inside the loop. /facepalm
and eax, -4
mov edx, eax
sub r8d, eax
cmp esi, eax
lea rdx, [r9+rdx*4] # recalc a pointer to the last element, maybe to avoid a data dependency on the pointer from the loop.
je .L5
cmp r8d, 1
mov DWORD PTR [rdx], 66 # fully-unrolled final up-to-3 stores
je .L5
cmp r8d, 2
mov DWORD PTR [rdx+4], 66
je .L5
mov DWORD PTR [rdx+8], 66
.L5:
rep ret
Even for a more complex loop which would benefit from a little bit of unrolling, gcc leaves the main vectorized loop not unrolled at all, but spends boatloads of code-size on fully-unrolled scalar prologue/epilogue. It's really bad for AVX2 256-bit vectorization with uint16_t
elements or something. (up to 15 elements in the prologue/epilogue, rather than 3). This is not a smart tradeoff, so it helps gcc7 and earlier significantly to tell it when pointers are aligned. (The execution speed doesn't change much, but it makes a big difference for reducing code-bloat.)
BTW, gcc8 favours using unaligned loads/stores, on the assumption that data often is aligned. Modern hardware has cheap unaligned 16 and 32-byte loads/stores, so letting the hardware handle the cost of loads/stores that are split across a cache-line boundary is often good. (AVX512 64-byte stores are often worth aligning, because any misalignment means a cache-line split on every access, not every other or every 4th.)
Another factor is that earlier gcc's fully-unrolled scalar prologues/epilogues are crap compared to smart handling where you do one unaligned potentially-overlapping vector at the start/end. (See the epilogue in this hand-written version of set42
). If gcc knew how to do that, it would be worth aligning more often.
size
information (which is generally correctly chosen to use SIMD). – Genro