Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang

B

3

2

It is known that GCC/CLang auto-vectorize loops well using SIMD instructions.

Also it is known that there exist alignas() standard C++ attribute, which among other uses also allows to align stack variable, for example following code:

Try it online!

#include <cstdint>
#include <iostream>

int main() {
    alignas(1024) int x[3] = {1, 2, 3};
    alignas(1024) int (&y)[3] = *(&x);

    std::cout << uint64_t(&x) % 1024 << " "
        << uint64_t(&x) % 16384 << std::endl;
    std::cout << uint64_t(&y) % 1024 << " "
        << uint64_t(&y) % 16384 << std::endl;
}

Outputs:

0 9216
0 9216

which means that both x and y are aligned on stack on 1024 bytes but not 16384 bytes.

Lets now see another code:

Try it online!

#include <cstdint>

void f(uint64_t * x, uint64_t * y) {
    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

if compiled with -std=c++20 -O3 -mavx512f attributes on GCC it produces following asm code (provided part of code):

        vmovdqu64       zmm1, ZMMWORD PTR [rdi]
        vpxorq  zmm0, zmm1, ZMMWORD PTR [rsi]
        vmovdqu64       ZMMWORD PTR [rdi], zmm0
        vmovdqu64       zmm0, ZMMWORD PTR [rsi+64]
        vpxorq  zmm0, zmm0, ZMMWORD PTR [rdi+64]
        vmovdqu64       ZMMWORD PTR [rdi+64], zmm0

which two times does AVX-512 unaligned load + xor + unaligned store. So we can understand that our 64-bit array-xor operation was auto-vectorized by GCC to use AVX-512 registers, and loop was unrolled too.

My question is how to tell GCC that provided to function pointers x and y are both aligned to 64 bytes, so that instead of unaligned load (vmovdqu64) like in code above, I can force GCC to use aligned load (vmovdqa64). It is known that aligned load/store can be considerably faster.

My first try to force GCC to do aligned load/store was through following code:

Try it online!

#include <cstdint>

void  g(uint64_t (&x_)[16],
        uint64_t const (&y_)[16]) {

    alignas(64) uint64_t (&x)[16] = x_;
    alignas(64) uint64_t const (&y)[16] = y_;

    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

but this code still produces unaligned load (vmovdqu64) same as in asm code above (of previous code snippet). Hence this alignas(64) hint doesn't give anything useful to improve GCC assembly code.

My Question is how do I force GCC to make aligned auto-vectorization, except for manually writing SIMD intrinsics for all operations like _mm512_load_epi64()?

If possible I need solutions for all of GCC/CLang/MSVC.

Billups answered 20/11, 2021 at 12:9 Comment(4)

The aligned load instruction is not required to make use of aligned loads: if the address is aligned, the load is aligned. See eg choice between aligned vs. unaligned x86 SIMD instructions – Edholm 20/11, 2021 at 12:19

@Edholm Do you mean that if assembly code contains unaligned vmovdqu64 instruction and if my pointer is aligned then this instruction will be decoded inside CPU as aligned instruction and will take same speed as aligned? Does it mean that manually using aligned vmovdqa64 will not speedup anything at all, not a bit? Why then there was aligned instruction introduced in CPU, if it gives not even a bit of speedup? – Billups 20/11, 2021 at 12:33

vmovdqa64 has a modest role as guarding against accidental misalignment. Back in the day (Core2 era and earlier) movdqu with an aligned address used to be significantly less efficient than movdqa, so back then it made more sense that they were separate instructions. – Edholm 20/11, 2021 at 12:49

@Billups It appear it was introduced to be faster for older processors but it is not really useful anymore. The instructions are kept for backward compatibility. So yes, there should be no speed up as long as you do not target old architectures AND you enable AVX so to use the VEX prefix (AVX is not enabled by default in GCC/Clang/VS). The benefit of the VEX prefix should only appear if your code is bounded by the instruction decoding which is not very frequent for good SIMD codes on newer processors (unless the loops are aggressively unrolled with a lot of loads/stores). – Angie 20/11, 2021 at 12:59

T

1

Though not entirely portable for all compilers, __builtin_assume_aligned will tell GCC to assume the pointer are aligned.

I often use a different strategy that is more portable using a helper struct:

template<size_t Bits>
struct alignas(Bits/8) uint64_block_t
{
    static const size_t bits = Bits;
    static const size_t size = bits/64;
    
    std::array<uint64_t,size> v;
    
    uint64_block_t& operator&=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] &= v2.v[i]; return *this; }
    uint64_block_t& operator^=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] ^= v2.v[i]; return *this; }
    uint64_block_t& operator|=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] |= v2.v[i]; return *this; }
    uint64_block_t operator&(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp &= v2; }
    uint64_block_t operator^(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp ^= v2; }
    uint64_block_t operator|(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp |= v2; }
    uint64_block_t operator~() const { uint64_block_t tmp; for (size_t i = 0; i < size; ++i) tmp.v[i] = ~v[i]; return tmp; }
    bool operator==(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return false; return true; }
    bool operator!=(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return true; return false; }
    
    bool get_bit(size_t c) const   { return (v[c/64]>>(c%64))&1; }
    void set_bit(size_t c)         { v[c/64] |= uint64_t(1)<<(c%64); }
    void flip_bit(size_t c)        { v[c/64] ^= uint64_t(1)<<(c%64); }
    void clear_bit(size_t c)       { v[c/64] &= ~(uint64_t(1)<<(c%64)); }
    void set_bit(size_t c, bool b) { v[c/64] &= ~(uint64_t(1)<<(c%64)); v[c/64] |= uint64_t(b ? 1 : 0)<<(c%64); }
    size_t hammingweight() const   { size_t w = 0; for (size_t i = 0; i < size; ++i) w += mccl::hammingweight(v[i]); return w; }
    bool parity() const            { uint64_t x = 0; for (size_t i = 0; i < size; ++i) x ^= v[i]; return mccl::hammingweight(x)%2; }
};

and then convert the pointer to uint64_t to a pointer to this struct using reinterpret_cast.

Converting a loop over uint64_t into a loop over these blocks typically auto vectorize very well.

Thurible answered 20/11, 2021 at 13:45 Comment(10)

std::assume_aligned is the portable way to access __builtin_assume_aligned – Aurelie 20/11, 2021 at 14:16

reinterpret_casting like that is UB though. – Battaglia 20/11, 2021 at 14:23

@Yuri Q: is it really UB for a pointer to a contiguous array of uint64_t that is guaranteed to have that alignment? – Thurible 20/11, 2021 at 14:26

@Marc the alignment is irrelevant. There is no object of type uint64_block_t<N> at the location pointed to by that pointer, so you aren't allowed to dereference it. – Battaglia 20/11, 2021 at 14:34

@Yuri, would the converse be allowed, i.e. not UB? Meaning, having a pointer to uint64_block_t<512> and recast it to a pointer to uint64_t? – Thurible 20/11, 2021 at 15:49

@Marc yes, but then you're only allowed to access the first Bits / 64 elements, i.e. the ones that are within the array within std::array within the first block. – Battaglia 20/11, 2021 at 16:22

@Yuri, I think I would disagree with that if the block is guaranteed to have no padding and be exactly an array of uint64_t. If you're given a pointer to an array of blocks and you're allowed to reinterpret_cast to uint64_t* and access the uint64_t within each then it stands to reason you can chain those accesses to a contiguous region of uint64_t. At each accessed memory location there is then precisely an uint64_t (as a member of a block). – Thurible 20/11, 2021 at 16:33

The problem is likely the strict aliasing rule here, and more specifically the fact that types are likely not "similar". The specification is not very clear on this point, but It appear this case is not explicitly accepted and so it theoretically results to an UB. In practice, I think std::array could have a stronger alignment requirements than its content (although I am not aware of any compiler doing that). AFAIK, GCC use a may_alias tag to ensure that there is no problem on x86/x86-64 SIMD types. – Angie 20/11, 2021 at 17:38

Reading up on the strict aliasing rule, there is the notion of pointer-interconvertible which allows to convert a pointer to a (standard-layout) object to a pointer to its first member. This is actually treated on cppreference under static_cast, so reinterpret_cast might not be needed from block to uint64_t pointer. But of reinterpret_cast would be needed to go back and recover a pointer to a block. – Thurible 20/11, 2021 at 18:4

@Marc The fact that memory is laid out exactly the same way as a single big array of ints doesn't matter. C++ abstract machine pointers are not simply addressed into linear memory (though in practice they are of course implemented like that almost universally), but a distinct concept with a specified behavior. There are specific situations when a pointer can be incremented, and this is not one of them. See #42420616 (actually I was wrong about accessing the ints in the first block, that's not allowed either). – Battaglia 20/11, 2021 at 18:16

B

1

Just now @MarcStevens suggested a working solution for my Question, through using __builtin_assume_aligned:

Try it online!

#include <cstdint>

void f(uint64_t * x_, uint64_t * y_) {
    uint64_t * x = (uint64_t *)__builtin_assume_aligned(x_, 64);
    uint64_t * y = (uint64_t *)__builtin_assume_aligned(y_, 64);

    for (int i = 0; i < 16; ++i)
        x[i] ^= y[i];
}

It actually produces code with aligned vmovdqa64 instruction.

But only GCC produces aligned instruction. CLang still uses unaligned, see here, also CLang uses AVX-512 registers only with more than 16 elements.

So still CLang and also MSVC solutions are welcome.

Billups answered 20/11, 2021 at 13:40 Comment(3)

Clang does "understand" __builtin_assume_aligned; for -march=icelake-client (which for now implies -mprefer-vector-width=256) it uses vmovaps. godbolt.org/z/shq9fr6GT. Why are you worried about the asm not using vmovdqa64? Do you want to detect accidental misalignment? __builtin_assume_aligned makes sure future compiler versions won't for example make asm that goes scalar until an alignment boundary, regardless of whether it chooses to not to bother with different instructions for the aligned case. (Because there's no perf difference at all.) – Supervisory 20/11, 2021 at 19:11

@PeterCordes The reason why I bother about using strictly aligned load/store is due to initial Question that I asked. And initially I asked that question only because I thought that on modern CPUs aligned load/store is faster. But as you and other people said, aligned load/store instructions are exactly same in speed as unaligned, so then it closes my initial Question, because I wanted to answer it only because keeping in mind extra possiblity in speed, which is not the case. But just to make clean Question/Answer, even if it is silly, I still made an Answer, now only just out of curiosity. – Billups 20/11, 2021 at 19:18

It's not silly to post about __builtin_assume_aligned, that's actually important for GCC10 and earlier: godbolt.org/z/hTeqaxa8v (assume_aligned(16)) / Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?. So yes it is useful for the compiler to know alignment, just not for the exact reason you thought. vmovdqu64 only slows down if the data happens to be misaligned at runtime, instead of faulting. – Supervisory 20/11, 2021 at 19:31

T

1

Though not entirely portable for all compilers, __builtin_assume_aligned will tell GCC to assume the pointer are aligned.

I often use a different strategy that is more portable using a helper struct:

template<size_t Bits>
struct alignas(Bits/8) uint64_block_t
{
    static const size_t bits = Bits;
    static const size_t size = bits/64;
    
    std::array<uint64_t,size> v;
    
    uint64_block_t& operator&=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] &= v2.v[i]; return *this; }
    uint64_block_t& operator^=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] ^= v2.v[i]; return *this; }
    uint64_block_t& operator|=(const uint64_block_t& v2) { for (size_t i = 0; i < size; ++i) v[i] |= v2.v[i]; return *this; }
    uint64_block_t operator&(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp &= v2; }
    uint64_block_t operator^(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp ^= v2; }
    uint64_block_t operator|(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp |= v2; }
    uint64_block_t operator~() const { uint64_block_t tmp; for (size_t i = 0; i < size; ++i) tmp.v[i] = ~v[i]; return tmp; }
    bool operator==(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return false; return true; }
    bool operator!=(const uint64_block_t& v2) const { for (size_t i = 0; i < size; ++i) if (v[i] != v2.v[i]) return true; return false; }
    
    bool get_bit(size_t c) const   { return (v[c/64]>>(c%64))&1; }
    void set_bit(size_t c)         { v[c/64] |= uint64_t(1)<<(c%64); }
    void flip_bit(size_t c)        { v[c/64] ^= uint64_t(1)<<(c%64); }
    void clear_bit(size_t c)       { v[c/64] &= ~(uint64_t(1)<<(c%64)); }
    void set_bit(size_t c, bool b) { v[c/64] &= ~(uint64_t(1)<<(c%64)); v[c/64] |= uint64_t(b ? 1 : 0)<<(c%64); }
    size_t hammingweight() const   { size_t w = 0; for (size_t i = 0; i < size; ++i) w += mccl::hammingweight(v[i]); return w; }
    bool parity() const            { uint64_t x = 0; for (size_t i = 0; i < size; ++i) x ^= v[i]; return mccl::hammingweight(x)%2; }
};