gcc optimization better at -O0 than -O3
Asked Answered
N

0

6

I recently made some vector-code and an appropriate godbolt example.

typedef float v8f __attribute__((vector_size(32)));
typedef unsigned v8u __attribute__((vector_size(32)));

v8f f(register v8f x)
{
  return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});
}

f:
        vmovaps ymm1, ymm0
        vxorps  xmm0, xmm0, xmm0
        vperm2f128      ymm0, ymm1, ymm0, 33
        vpalignr        ymm0, ymm0, ymm1, 4
        ret

I wanted to see how different optimization (-O0/O1/O2/O3) settings affected the code, and all but -O0 gave identical code. -O0 gave the predictable frame-pointer garbage, and also copies the argument x to a stack local variable for no good reason. To fix this, I added the register storage class specifier:

typedef float v8f __attribute__((vector_size(32)));
typedef unsigned v8u __attribute__((vector_size(32)));

v8f f(register v8f x)
{
  return __builtin_shuffle(x, (v8f){0}, (v8u){1, 2, 3, 4, 5, 6, 7, 8});
}

For -O1/O2/O3, the generated code is identical, but at -O0:

f:
        vxorps  xmm1, xmm1, xmm1
        vperm2f128      ymm1, ymm0, ymm1, 33
        vpalignr        ymm0, ymm1, ymm0, 4
        ret

gcc figured out how to avoid a redundant register-copy. While such a copy might be move-eliminated, this still increases code size for no benefit (-Os is bigger than -O0?).

How/why does gcc generate better code for this at -O0 than -O3?

Novokuznetsk answered 23/5, 2020 at 16:54 Comment(8)
Of course in real life you'd be inlining this function. Do you still get redundant moves if the function is inlined into a more realistic context?Hospitable
Looks to be similar to gcc bug Sub-optimal YMM register allocation.Betony
@NateEldredge Good point! godbolt says that even inlined, the redundant vmovaps ymm2, ymm0 vxorps xmm0, xmm0, xmm0 remains. Now that I've seen it, it really bugs me. WTH is gcc copying a register only to zero the source immediately?Novokuznetsk
@MaximEgorushkin Looks very similar, but 1) my repo may be smaller (in terms of assembly generated) and 2) in my repo -O0 generates perfect code.Novokuznetsk
In my example register doesn't help, unfortunately: gcc.godbolt.org/z/j7ns57Betony
@MaximEgorushkin Yeah, it's not so much that register helps than that -O0 (and partially -O1, no vmovaps here, but also no vfnmadd231ps, so a wash) seem to avoid the redundant register copies. Unfortunately, they obviously don't produce anywhere near as good code otherwise, and register can't compensate completely (also, being deprecated in C++?).Novokuznetsk
@EOF: your "inlining" test is still inlining into a relatively tiny function; as commented on Maxim's GCC bug, these wasted MOV instructions are more common in tiny functions due to hard constraints from calling convention boundaries. They do sometimes happen for real after inlining into a loop or something non-trivial, but in my experience usually only when you want the 128-bit low and high halves of a vector and GCC decides to zero-extend the low half to 256 with an XMM vmovdqa for no reason.Lenoralenore
@PeterCordes Well, maybe. Of course you also wouldn't care so much about this if it only happens at the start of a function if the (inlined) function does a lot more work, proportionally. But a variant of the link I gave for borderline better -O1-optimization to MaximEgorushkin shows the redundant vmovaps not as the first instruction of the function, so I'm not convinced this is as rare as claimed. (Also, seems the -O1-code has a regression in gcc 10.1 on godbolt.)Novokuznetsk

© 2022 - 2024 — McMap. All rights reserved.