Vectorization of sin and cos
Asked Answered
E

1

10

I was playing around with Compiler Explorer and ran into an anomaly (I think). If I want to make the compiler vectorize a sin calculation using libmvec, I would write:

#include <cmath>

#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;

inline T s(const T x)
{
  return sinf(x);
}

void func(AT* __restrict x, AT* __restrict y, int length)
{
  if (length & NN-1) __builtin_unreachable();
  for (int i = 0; i < length; i++)
  {
    y[i] = s(x[i]);
  }
}

compile with gcc 6.2 and -O3 -march=native -ffast-math and get

func(float*, float*, int):
        testl   %edx, %edx
        jle     .L10
        leaq    8(%rsp), %r10
        andq    $-32, %rsp
        pushq   -8(%r10)
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r14
        xorl    %r14d, %r14d
        pushq   %r13
        leal    -8(%rdx), %r13d
        pushq   %r12
        shrl    $3, %r13d
        movq    %rsi, %r12
        pushq   %r10
        addl    $1, %r13d
        pushq   %rbx
        movq    %rdi, %rbx
        subq    $8, %rsp
.L4:
        vmovaps (%rbx), %ymm0
        addl    $1, %r14d
        addq    $32, %r12
        addq    $32, %rbx
        call    _ZGVcN8v_sinf      // YAY! Vectorized trig!
        vmovaps %ymm0, -32(%r12)
        cmpl    %r13d, %r14d
        jb      .L4
        vzeroupper
        addq    $8, %rsp
        popq    %rbx
        popq    %r10
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %rbp
        leaq    -8(%r10), %rsp
.L10:
        ret

But when I add a cosine to the function, there is no vectorization:

#include <cmath>

#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;

inline T f(const T x)
{
  return cosf(x)+sinf(x);
}

void func(AT* __restrict x, AT* __restrict y, int length)
{
  if (length & NN-1) __builtin_unreachable();
  for (int i = 0; i < length; i++)
  {
    y[i] = f(x[i]);
  }
}

which gives:

func(float*, float*, int):
        testl   %edx, %edx
        jle     .L10
        pushq   %r12
        leal    -1(%rdx), %eax
        pushq   %rbp
        leaq    4(%rdi,%rax,4), %r12
        movq    %rsi, %rbp
        pushq   %rbx
        movq    %rdi, %rbx
        subq    $16, %rsp
.L4:
        vmovss  (%rbx), %xmm0
        leaq    8(%rsp), %rsi
        addq    $4, %rbx
        addq    $4, %rbp
        leaq    12(%rsp), %rdi
        call    sincosf               // No vectorization
        vmovss  12(%rsp), %xmm0
        vaddss  8(%rsp), %xmm0, %xmm0
        vmovss  %xmm0, -4(%rbp)
        cmpq    %rbx, %r12
        jne     .L4
        addq    $16, %rsp
        popq    %rbx
        popq    %rbp
        popq    %r12
.L10:
        ret

I see two good alternatives. Either call a vectorized version of sincosf or call the vectorized sin and cos sequentially. I tried adding -fno-builtin-sincos to no avail. -fopt-info-vec-missed complains about complex float, which there is none.

Is this a known issue with gcc? Either way, is there a way I can convince gcc to vectorize the latter example?

(As an aside, is there any way to get gcc < 6 to vectorize trigonometric functions automatically?)

Eugine answered 20/9, 2016 at 9:54 Comment(12)
If you have the function for sin, could you not write the one for cos by changing x to pi/2 -x, expanding pi/2 to a vector if needed?Fazio
Also if you are actually after sin plus cos, you could use sqrt(2)sin(x plus pi/4)Fazio
I'm not actually interested in sin+cos. I am kind of interested in exp(complex). But the question was meant to be more general than a single use case. In the end, I would like to be able to rely on multiple compilers (nvcc, gcc, cl) to compile the same code efficiently without writing the optimizations manually.Eugine
@JeremyKahan And yes, I could write sin(x+c), I just don't want to. The code remains easier to read when we say what we mean. In the end, that's the real crux of the issue.Eugine
hello, I always wondered what was the name of "__" in code, need to know the name to search for that.Silken
how do you know that _ZGVcN8v_sinf is a vectorized function and sincosf is not?Leatherback
@phuclv: because sincosf is the standard libm function, and _ZGVcN8v_sinf is how libmvec mangles SIMD type names: sourceware.org/glibc/wiki/libmvec mentions _ZGVdN4v_cos as AVX2 cos() (of 4x double).Cerous
@phuclv: more obviously, look at the surrounding asm using ymm0 as a return value vs. the low element of xmm0. Anyway yeah, still a missed optimization with current gcc9.1. Using either sinf() or cosf() alone vectorizes, so it would certainly be possible to spill a vector of sinf results across a call to a vectorized cosf. IDK if maybe there's no vectorized sincosf and that optimization (of combining sin and cos into sincos) happens first, defeating vectorization.Cerous
And sincosf returning by reference is a stupid limitation imposed by C only allowing functions to return 1 value; a vectorized sincosf (if/when one exists) should hopefully just return in ymm0 and ymm1. There should be a scalar one that returns in registers, too. ISO C's sincos()'s design of returning both values by reference is also dumb; at least return one of them by value.Cerous
ICC vectorizes this with SVML __svml_sincosf8_l9. gcc.godbolt.org/z/lvKJsJ. IIRC, you can tell gcc that SVML is available, if you have it installed. Clang vectorizes around unpacking to scalar for sincosf, which is terrible. (And also doesn't properly vectorize plain sinf(), so maybe it needs options to tell it that libmvec is available.) Probably worth investigating.Cerous
gcc.gnu.org/bugzilla/show_bug.cgi?id=70901 (gcc.gnu.org/bugzilla/show_bug.cgi?id=40770)Munger
@MarcGlisse this should be a proper answer.Pacha
T
0

This seems to be limitation of GCC, as noted in the comments. The linked bugs are still open at the time of writting:

The second bug was created in 2009 and is still open, I guess this is non-trivial to implement.

Trembles answered 29/8, 2024 at 12:0 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.