GCC fails to optimize aligned std::array like C array

Here's some code which GCC 6 and 7 fail to optimize when using std::array:

#include <array>

static constexpr size_t my_elements = 8;

class Foo
{
public:
#ifdef C_ARRAY
    typedef double Vec[my_elements] alignas(32);
#else
    typedef std::array<double, my_elements> Vec alignas(32);
#endif
    void fun1(const Vec&);
    Vec v1{{}};
};

void Foo::fun1(const Vec& __restrict__ v2)
{
    for (unsigned i = 0; i < my_elements; ++i)
    {
        v1[i] += v2[i];
    }
}

Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY produces nice code:

    vmovapd ymm0, YMMWORD PTR [rdi]
    vaddpd  ymm0, ymm0, YMMWORD PTR [rsi]
    vmovapd YMMWORD PTR [rdi], ymm0
    vmovapd ymm0, YMMWORD PTR [rdi+32]
    vaddpd  ymm0, ymm0, YMMWORD PTR [rsi+32]
    vmovapd YMMWORD PTR [rdi+32], ymm0
    vzeroupper

That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY, you get a huge mess starting with this:

    mov     rax, rdi
    shr     rax, 3
    neg     rax
    and     eax, 3
    je      .L7

The code generated in this case (using std::array instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.

It seems that GCC doesn't understand that the contents of an std::array are aligned the same as the std::array itself. This breaks the assumption that using std::array instead of C arrays does not incur a runtime cost.

Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:

void Foo::fun2(const Vec& __restrict__ v2)
{
    typedef double V2 alignas(Foo::Vec);
    const V2* v2a = static_cast<const V2*>(&v2[0]);

    for (unsigned i = 0; i < my_elements; ++i)
    {
        v1[i] += v2a[i];
    }
}

Also note: if my_elements is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.

You can see it live here: https://godbolt.org/g/IXIOst

Interestingly, if you replace v1[i] += v2a[i]; with v1._M_elems[i] += v2._M_elems[i]; (which is obviously not portable), gcc manages to optimize the std::array case as well as the case of the C array.

Possible interpretation: in the gcc dumps (-fdump-tree-all-all), one can see MEM[(struct FooD.25826 *)this_7(D) clique 1 base 0].v1D.25832[i_15] in the C array case, and MEM[(const value_typeD.25834 &)v2_7(D) clique 1 base 1][_1] for std::array. That is, in the second case, gcc may have forgotten that this is part of type Foo and only remembers that it is accessing a double.

This is an abstraction penalty that comes from all the inline functions one has to go through to finally see the array access. Clang still manages to vectorize nicely (even after removing alignas!). This likely means that clang vectorizes without caring about alignment, and indeed it uses instructions like vmovupd that do not require an aligned address.

The hack you found, casting to Vec, is another way to let the compiler see, when it handles the memory access, that the type being handled is aligned. For a regular std::array::operator[], the memory access happens inside a member function of std::array, which doesn't have any clue that *this happens to be aligned.

Gcc also has a builtin to let the compiler know about alignment:

const double*v2a=static_cast<const double*>(__builtin_assume_aligned(v2.data(),32));

Recommended topics

Hot tags