It is known that GCC/CLang auto-vectorize loops well using SIMD instructions.
Also it is known that there exist alignas() standard C++ attribute, which among other uses also allows to align stack variable, for example following code:
#include <cstdint>
#include <iostream>
int main() {
alignas(1024) int x[3] = {1, 2, 3};
alignas(1024) int (&y)[3] = *(&x);
std::cout << uint64_t(&x) % 1024 << " "
<< uint64_t(&x) % 16384 << std::endl;
std::cout << uint64_t(&y) % 1024 << " "
<< uint64_t(&y) % 16384 << std::endl;
}
Outputs:
0 9216
0 9216
which means that both x
and y
are aligned on stack on 1024 bytes but not 16384 bytes.
Lets now see another code:
#include <cstdint>
void f(uint64_t * x, uint64_t * y) {
for (int i = 0; i < 16; ++i)
x[i] ^= y[i];
}
if compiled with -std=c++20 -O3 -mavx512f
attributes on GCC it produces following asm code (provided part of code):
vmovdqu64 zmm1, ZMMWORD PTR [rdi]
vpxorq zmm0, zmm1, ZMMWORD PTR [rsi]
vmovdqu64 ZMMWORD PTR [rdi], zmm0
vmovdqu64 zmm0, ZMMWORD PTR [rsi+64]
vpxorq zmm0, zmm0, ZMMWORD PTR [rdi+64]
vmovdqu64 ZMMWORD PTR [rdi+64], zmm0
which two times does AVX-512 unaligned load + xor + unaligned store. So we can understand that our 64-bit array-xor operation was auto-vectorized by GCC to use AVX-512 registers, and loop was unrolled too.
My question is how to tell GCC that provided to function pointers x
and y
are both aligned to 64 bytes, so that instead of unaligned load (vmovdqu64
) like in code above, I can force GCC to use aligned load (vmovdqa64
). It is known that aligned load/store can be considerably faster.
My first try to force GCC to do aligned load/store was through following code:
#include <cstdint>
void g(uint64_t (&x_)[16],
uint64_t const (&y_)[16]) {
alignas(64) uint64_t (&x)[16] = x_;
alignas(64) uint64_t const (&y)[16] = y_;
for (int i = 0; i < 16; ++i)
x[i] ^= y[i];
}
but this code still produces unaligned load (vmovdqu64
) same as in asm code above (of previous code snippet). Hence this alignas(64)
hint doesn't give anything useful to improve GCC assembly code.
My Question is how do I force GCC to make aligned auto-vectorization, except for manually writing SIMD intrinsics for all operations like _mm512_load_epi64()
?
If possible I need solutions for all of GCC/CLang/MSVC.
vmovdqu64
instruction and if my pointer is aligned then this instruction will be decoded inside CPU as aligned instruction and will take same speed as aligned? Does it mean that manually using alignedvmovdqa64
will not speedup anything at all, not a bit? Why then there was aligned instruction introduced in CPU, if it gives not even a bit of speedup? – Billupsvmovdqa64
has a modest role as guarding against accidental misalignment. Back in the day (Core2 era and earlier)movdqu
with an aligned address used to be significantly less efficient thanmovdqa
, so back then it made more sense that they were separate instructions. – Edholm