GCC fails to optimize aligned std::array like C array
Asked Answered
R

1

25

Here's some code which GCC 6 and 7 fail to optimize when using std::array:

#include <array>

static constexpr size_t my_elements = 8;

class Foo
{
public:
#ifdef C_ARRAY
    typedef double Vec[my_elements] alignas(32);
#else
    typedef std::array<double, my_elements> Vec alignas(32);
#endif
    void fun1(const Vec&);
    Vec v1{{}};
};

void Foo::fun1(const Vec& __restrict__ v2)
{
    for (unsigned i = 0; i < my_elements; ++i)
    {
        v1[i] += v2[i];
    }
}

Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY produces nice code:

    vmovapd ymm0, YMMWORD PTR [rdi]
    vaddpd  ymm0, ymm0, YMMWORD PTR [rsi]
    vmovapd YMMWORD PTR [rdi], ymm0
    vmovapd ymm0, YMMWORD PTR [rdi+32]
    vaddpd  ymm0, ymm0, YMMWORD PTR [rsi+32]
    vmovapd YMMWORD PTR [rdi+32], ymm0
    vzeroupper

That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY, you get a huge mess starting with this:

    mov     rax, rdi
    shr     rax, 3
    neg     rax
    and     eax, 3
    je      .L7

The code generated in this case (using std::array instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.

It seems that GCC doesn't understand that the contents of an std::array are aligned the same as the std::array itself. This breaks the assumption that using std::array instead of C arrays does not incur a runtime cost.

Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:

void Foo::fun2(const Vec& __restrict__ v2)
{
    typedef double V2 alignas(Foo::Vec);
    const V2* v2a = static_cast<const V2*>(&v2[0]);

    for (unsigned i = 0; i < my_elements; ++i)
    {
        v1[i] += v2a[i];
    }
}

Also note: if my_elements is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.

You can see it live here: https://godbolt.org/g/IXIOst

Respire answered 27/4, 2017 at 8:0 Comment(10)
FWIW, clang complains that alignas needs to be on a data member, not on a typedef, but if changing Vec to a nested class holding std::array<...> as an aligned data member, and giving it operator[] overloads, then clang does manage to optimise this. GCC still doesn't.Ratha
Does the array underlying std::array has the same alignment as the std::array?Superannuated
And if Vec is implemented as a class holding double data[my_elements] alignas(32);, with custom operator[], then GCC does manage to optimise this. I suspect the problem is that array::operator[] returns an unaligned double & which comes from its unaligned array::_M_elems member, and the fact that it's part of an aligned array is just a tad too far for the optimiser to be able to see it.Ratha
@NickyC No, but yes. It doesn't get an implicit alignas, but it does of course get aligned, since it's stored at 0 bytes into the std::array.Ratha
Surprising because you can verify easily that std::array<double, 8> is standard layout. I would think that with a standard layout class these things would propagate in a straightforward way.Engage
So, obviously a compiler bug. If you want it solved you should report it via bugzilla.Mirilla
@RustyX: While I would love for GCC to someday fix this, my question here is stated: Is there something simple I'm missing that would fix this? In other words, I would like a relatively unobtrusive workaround that would enable optimum performance for std::array on GCC 6. I don't want to simply hold my breath for GCC 8.Respire
@RustyX: I've reported it here: gcc.gnu.org/bugzilla/show_bug.cgi?id=80561Respire
I am not sure this is valid C++ (which is why clang can't compile it). eel.is/c++draft/dcl.align#1 "An alignment-specifier may be applied to a variable or to a class data member, but it shall not be applied to a bit-field, a function parameter, or an exception-declaration (18.3). "Prank
@eleanora: You can read the earlier comment from "hvd" that explains how to make an equivalent version that Clang will compile and GCC still fails to optimize.Respire
E
19

Interestingly, if you replace v1[i] += v2a[i]; with v1._M_elems[i] += v2._M_elems[i]; (which is obviously not portable), gcc manages to optimize the std::array case as well as the case of the C array.

Possible interpretation: in the gcc dumps (-fdump-tree-all-all), one can see MEM[(struct FooD.25826 *)this_7(D) clique 1 base 0].v1D.25832[i_15] in the C array case, and MEM[(const value_typeD.25834 &)v2_7(D) clique 1 base 1][_1] for std::array. That is, in the second case, gcc may have forgotten that this is part of type Foo and only remembers that it is accessing a double.

This is an abstraction penalty that comes from all the inline functions one has to go through to finally see the array access. Clang still manages to vectorize nicely (even after removing alignas!). This likely means that clang vectorizes without caring about alignment, and indeed it uses instructions like vmovupd that do not require an aligned address.

The hack you found, casting to Vec, is another way to let the compiler see, when it handles the memory access, that the type being handled is aligned. For a regular std::array::operator[], the memory access happens inside a member function of std::array, which doesn't have any clue that *this happens to be aligned.

Gcc also has a builtin to let the compiler know about alignment:

const double*v2a=static_cast<const double*>(__builtin_assume_aligned(v2.data(),32));
Eudemonia answered 27/4, 2017 at 20:20 Comment(1)
Thanks a lot for filing the bug report :-)Eudemonia

© 2022 - 2024 — McMap. All rights reserved.