I was playing around with Compiler Explorer and ran into an anomaly (I think). If I want to make the compiler vectorize a sin
calculation using libmvec, I would write:
#include <cmath>
#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;
inline T s(const T x)
{
return sinf(x);
}
void func(AT* __restrict x, AT* __restrict y, int length)
{
if (length & NN-1) __builtin_unreachable();
for (int i = 0; i < length; i++)
{
y[i] = s(x[i]);
}
}
compile with gcc 6.2 and -O3 -march=native -ffast-math
and get
func(float*, float*, int):
testl %edx, %edx
jle .L10
leaq 8(%rsp), %r10
andq $-32, %rsp
pushq -8(%r10)
pushq %rbp
movq %rsp, %rbp
pushq %r14
xorl %r14d, %r14d
pushq %r13
leal -8(%rdx), %r13d
pushq %r12
shrl $3, %r13d
movq %rsi, %r12
pushq %r10
addl $1, %r13d
pushq %rbx
movq %rdi, %rbx
subq $8, %rsp
.L4:
vmovaps (%rbx), %ymm0
addl $1, %r14d
addq $32, %r12
addq $32, %rbx
call _ZGVcN8v_sinf // YAY! Vectorized trig!
vmovaps %ymm0, -32(%r12)
cmpl %r13d, %r14d
jb .L4
vzeroupper
addq $8, %rsp
popq %rbx
popq %r10
popq %r12
popq %r13
popq %r14
popq %rbp
leaq -8(%r10), %rsp
.L10:
ret
But when I add a cosine
to the function, there is no vectorization:
#include <cmath>
#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;
inline T f(const T x)
{
return cosf(x)+sinf(x);
}
void func(AT* __restrict x, AT* __restrict y, int length)
{
if (length & NN-1) __builtin_unreachable();
for (int i = 0; i < length; i++)
{
y[i] = f(x[i]);
}
}
which gives:
func(float*, float*, int):
testl %edx, %edx
jle .L10
pushq %r12
leal -1(%rdx), %eax
pushq %rbp
leaq 4(%rdi,%rax,4), %r12
movq %rsi, %rbp
pushq %rbx
movq %rdi, %rbx
subq $16, %rsp
.L4:
vmovss (%rbx), %xmm0
leaq 8(%rsp), %rsi
addq $4, %rbx
addq $4, %rbp
leaq 12(%rsp), %rdi
call sincosf // No vectorization
vmovss 12(%rsp), %xmm0
vaddss 8(%rsp), %xmm0, %xmm0
vmovss %xmm0, -4(%rbp)
cmpq %rbx, %r12
jne .L4
addq $16, %rsp
popq %rbx
popq %rbp
popq %r12
.L10:
ret
I see two good alternatives. Either call a vectorized version of sincosf
or call the vectorized sin
and cos
sequentially. I tried adding -fno-builtin-sincos
to no avail. -fopt-info-vec-missed
complains about complex float, which there is none.
Is this a known issue with gcc? Either way, is there a way I can convince gcc to vectorize the latter example?
(As an aside, is there any way to get gcc < 6 to vectorize trigonometric functions automatically?)
sin(x+c)
, I just don't want to. The code remains easier to read when we say what we mean. In the end, that's the real crux of the issue. – Eugine_ZGVcN8v_sinf
is a vectorized function andsincosf
is not? – Leatherbacksincosf
is the standard libm function, and_ZGVcN8v_sinf
is how libmvec mangles SIMD type names: sourceware.org/glibc/wiki/libmvec mentions_ZGVdN4v_cos
as AVX2cos()
(of 4xdouble
). – Cerousymm0
as a return value vs. the low element ofxmm0
. Anyway yeah, still a missed optimization with current gcc9.1. Using eithersinf()
orcosf()
alone vectorizes, so it would certainly be possible to spill a vector ofsinf
results across a call to a vectorizedcosf
. IDK if maybe there's no vectorizedsincosf
and that optimization (of combining sin and cos into sincos) happens first, defeating vectorization. – Ceroussincosf
returning by reference is a stupid limitation imposed by C only allowing functions to return 1 value; a vectorizedsincosf
(if/when one exists) should hopefully just return in ymm0 and ymm1. There should be a scalar one that returns in registers, too. ISO C'ssincos()
's design of returning both values by reference is also dumb; at least return one of them by value. – Cerous__svml_sincosf8_l9
. gcc.godbolt.org/z/lvKJsJ. IIRC, you can tell gcc that SVML is available, if you have it installed. Clang vectorizes around unpacking to scalar forsincosf
, which is terrible. (And also doesn't properly vectorize plainsinf()
, so maybe it needs options to tell it that libmvec is available.) Probably worth investigating. – Cerous