Here's some code which GCC 6 and 7 fail to optimize when using std::array
:
#include <array>
static constexpr size_t my_elements = 8;
class Foo
{
public:
#ifdef C_ARRAY
typedef double Vec[my_elements] alignas(32);
#else
typedef std::array<double, my_elements> Vec alignas(32);
#endif
void fun1(const Vec&);
Vec v1{{}};
};
void Foo::fun1(const Vec& __restrict__ v2)
{
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2[i];
}
}
Compiling the above with g++ -std=c++14 -O3 -march=haswell -S -DC_ARRAY
produces nice code:
vmovapd ymm0, YMMWORD PTR [rdi]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi]
vmovapd YMMWORD PTR [rdi], ymm0
vmovapd ymm0, YMMWORD PTR [rdi+32]
vaddpd ymm0, ymm0, YMMWORD PTR [rsi+32]
vmovapd YMMWORD PTR [rdi+32], ymm0
vzeroupper
That's basically two unrolled iterations of adding four doubles at a time via 256-bit registers. But if you compile without -DC_ARRAY
, you get a huge mess starting with this:
mov rax, rdi
shr rax, 3
neg rax
and eax, 3
je .L7
The code generated in this case (using std::array
instead of a plain C array) seems to check for alignment of the input array--even though it is specified in the typedef as aligned to 32 bytes.
It seems that GCC doesn't understand that the contents of an std::array
are aligned the same as the std::array
itself. This breaks the assumption that using std::array
instead of C arrays does not incur a runtime cost.
Is there something simple I'm missing that would fix this? So far I came up with an ugly hack:
void Foo::fun2(const Vec& __restrict__ v2)
{
typedef double V2 alignas(Foo::Vec);
const V2* v2a = static_cast<const V2*>(&v2[0]);
for (unsigned i = 0; i < my_elements; ++i)
{
v1[i] += v2a[i];
}
}
Also note: if my_elements
is 4 instead of 8, the problem does not occur. If you use Clang, the problem does not occur.
You can see it live here: https://godbolt.org/g/IXIOst
alignas
needs to be on a data member, not on a typedef, but if changingVec
to a nested class holdingstd::array<...>
as an aligned data member, and giving itoperator[]
overloads, then clang does manage to optimise this. GCC still doesn't. – Rathastd::array
has the same alignment as thestd::array
? – SuperannuatedVec
is implemented as a class holdingdouble data[my_elements] alignas(32);
, with customoperator[]
, then GCC does manage to optimise this. I suspect the problem is thatarray::operator[]
returns an unaligneddouble &
which comes from its unalignedarray::_M_elems
member, and the fact that it's part of an aligned array is just a tad too far for the optimiser to be able to see it. – Rathaalignas
, but it does of course get aligned, since it's stored at 0 bytes into thestd::array
. – Rathastd::array<double, 8>
is standard layout. I would think that with a standard layout class these things would propagate in a straightforward way. – Engage