I recently made some vector-code and an appropriate godbolt example.
typedef float v8f __attribute__((vector_size(32)));
typedef unsigned v8u __attribute__((vector_size(32)));
v8f f(register v8f x)
{
return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});
}
f:
vmovaps ymm1, ymm0
vxorps xmm0, xmm0, xmm0
vperm2f128 ymm0, ymm1, ymm0, 33
vpalignr ymm0, ymm0, ymm1, 4
ret
I wanted to see how different optimization (-O0/O1/O2/O3
) settings affected the code, and all but -O0
gave identical code. -O0
gave the predictable frame-pointer garbage, and also copies the argument x
to a stack local variable for no good reason. To fix this, I added the register
storage class specifier:
typedef float v8f __attribute__((vector_size(32)));
typedef unsigned v8u __attribute__((vector_size(32)));
v8f f(register v8f x)
{
return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});
}
For -O1/O2/O3
, the generated code is identical, but at -O0
:
f:
vxorps xmm1, xmm1, xmm1
vperm2f128 ymm1, ymm0, ymm1, 33
vpalignr ymm0, ymm1, ymm0, 4
ret
gcc
figured out how to avoid a redundant register-copy. While such a copy might be move-eliminated, this still increases code size for no benefit (-Os
is bigger than -O0
?).
How/why does gcc
generate better code for this at -O0
than -O3
?
vmovaps ymm2, ymm0 vxorps xmm0, xmm0, xmm0
remains. Now that I've seen it, it really bugs me. WTH isgcc
copying a register only to zero the source immediately? – Novokuznetsk-O0
generates perfect code. – Novokuznetskregister
doesn't help, unfortunately: gcc.godbolt.org/z/j7ns57 – Betonyregister
helps than that-O0
(and partially-O1
, novmovaps
here, but also novfnmadd231ps
, so a wash) seem to avoid the redundant register copies. Unfortunately, they obviously don't produce anywhere near as good code otherwise, andregister
can't compensate completely (also, being deprecated in C++?). – Novokuznetskvmovdqa
for no reason. – Lenoralenore-O1
-optimization to MaximEgorushkin shows the redundantvmovaps
not as the first instruction of the function, so I'm not convinced this is as rare as claimed. (Also, seems the-O1
-code has a regression ingcc 10.1
on godbolt.) – Novokuznetsk