Calling fsincos instruction in LLVM slower than calling libc sin/cos functions?
Asked Answered
H

2

17

I am working on a language that is compiled with LLVM. Just for fun, I wanted to do some microbenchmarks. In one, I run some million sin / cos computations in a loop. In pseudocode, it looks like this:

var x: Double = 0.0
for (i <- 0 to 100 000 000)
  x = sin(x)^2 + cos(x)^2
return x.toInteger

If I'm computing sin/cos using LLVM IR inline assembly in the form:

%sc = call { double, double } asm "fsincos", "={st(1)},={st},1,~{dirflag},~{fpsr},~{flags}" (double %"res") nounwind

this is faster than using fsin and fcos separately instead of fsincos. However, it is slower than if I calling the llvm.sin.f64 and llvm.cos.f64 intrinsics separately, which compile to calls to the C math lib functions, at least with the target settings I'm using (x86_64 with SSE enabled).

It seems LLVM inserts some conversions between single/double precision FP -- that might be the culprit. Why is that? Sorry, I'm a relative newbie at assembly:

    .globl  main
    .align  16, 0x90
    .type   main,@function
main:                                   # @main
    .cfi_startproc
# BB#0:                                 # %loopEntry1
    xorps   %xmm0, %xmm0
    movl    $-1, %eax
    jmp     .LBB44_1
    .align  16, 0x90
.LBB44_2:                               # %then4
                                    #   in Loop: Header=BB44_1 Depth=1
    movss   %xmm0, -4(%rsp)
    flds    -4(%rsp)
    #APP
    fsincos
    #NO_APP
    fstpl   -16(%rsp)
    fstpl   -24(%rsp)
    movsd   -16(%rsp), %xmm0
    mulsd   %xmm0, %xmm0
    cvtsd2ss        %xmm0, %xmm1
    movsd   -24(%rsp), %xmm0
    mulsd   %xmm0, %xmm0
    cvtsd2ss        %xmm0, %xmm0
    addss   %xmm1, %xmm0
.LBB44_1:                               # %loop2
                                    # =>This Inner Loop Header: Depth=1
    incl    %eax
    cmpl    $99999999, %eax         # imm = 0x5F5E0FF
    jle     .LBB44_2
# BB#3:                                 # %break3
    cvttss2si       %xmm0, %eax
    ret
.Ltmp160:
    .size   main, .Ltmp160-main
    .cfi_endproc

Same test with calls to llvm sin/cos intrinsics:

    .globl  main
    .align  16, 0x90
    .type   main,@function
main:                                   # @main
    .cfi_startproc
# BB#0:                                 # %loopEntry1
    pushq   %rbx
.Ltmp162:
    .cfi_def_cfa_offset 16
    subq    $16, %rsp
.Ltmp163:
    .cfi_def_cfa_offset 32
.Ltmp164:
    .cfi_offset %rbx, -16
    xorps   %xmm0, %xmm0
    movl    $-1, %ebx
    jmp     .LBB44_1
    .align  16, 0x90
.LBB44_2:                               # %then4
                                    #   in Loop: Header=BB44_1 Depth=1
    movsd   %xmm0, (%rsp)           # 8-byte Spill
    callq   cos
    mulsd   %xmm0, %xmm0
    movsd   %xmm0, 8(%rsp)          # 8-byte Spill
    movsd   (%rsp), %xmm0           # 8-byte Reload
    callq   sin
    mulsd   %xmm0, %xmm0
    addsd   8(%rsp), %xmm0          # 8-byte Folded Reload
.LBB44_1:                               # %loop2
                                    # =>This Inner Loop Header: Depth=1
    incl    %ebx
    cmpl    $99999999, %ebx         # imm = 0x5F5E0FF
    jle     .LBB44_2
# BB#3:                                 # %break3
    cvttsd2si       %xmm0, %eax
    addq    $16, %rsp
    popq    %rbx
    ret
.Ltmp165:
    .size   main, .Ltmp165-main
    .cfi_endproc

Can you suggest how the ideal assembly would look like with fsincos? PS. Adding -enable-unsafe-fp-math to llc makes the conversions disappear and switches to doubles (fldl etc.), but the speed remains the same.

    .globl  main
    .align  16, 0x90
    .type   main,@function
main:                                   # @main
    .cfi_startproc
# BB#0:                                 # %loopEntry1
    xorps   %xmm0, %xmm0
    movl    $-1, %eax
    jmp     .LBB44_1
    .align  16, 0x90
.LBB44_2:                               # %then4
                                    #   in Loop: Header=BB44_1 Depth=1
    movsd   %xmm0, -8(%rsp)
    fldl    -8(%rsp)
    #APP
    fsincos
    #NO_APP
    fstpl   -24(%rsp)
    fstpl   -16(%rsp)
    movsd   -24(%rsp), %xmm1
    mulsd   %xmm1, %xmm1
    movsd   -16(%rsp), %xmm0
    mulsd   %xmm0, %xmm0
    addsd   %xmm1, %xmm0
.LBB44_1:                               # %loop2
                                    # =>This Inner Loop Header: Depth=1
    incl    %eax
    cmpl    $99999999, %eax         # imm = 0x5F5E0FF
    jle     .LBB44_2
# BB#3:                                 # %break3
    cvttsd2si       %xmm0, %eax
    ret
.Ltmp160:
    .size   main, .Ltmp160-main
    .cfi_endproc
Helio answered 18/9, 2012 at 21:18 Comment(4)
Hmm.. I think I'm starting to get it. fsin/fcos/fsincos use x87 registers and mulsd addsd use MMX / SSE. So the overhead is from moving the data between them probably?Helio
No, cvtsd2ss is a conversion from double to float. But stay away from legacy coprocessor instructions, they are slower and more imprecise than library routines nowadays. See for instance gcc.gnu.org/ml/gcc/2012-02/msg00188.htmlTruc
And yes, there is additional overhead from moving, but it doesn't amount to much compared to the 200-300 cycles fsincos uses.Truc
Thanks, I guess I'll stick with the llvm sin/cos intrinsics then.Helio
F
23

Hardware trig is slow.

Too many documents claim that x87 instructions like fsin or fsincos are the fastest way to do trigonometric functions. Those claims are often wrong.

The fastest way depends on your CPU. As CPUs become faster, old hardware trig instructions like fsin have not kept pace. With some CPUs, a software function, using a polynomial approximation for sine or another trig function, is now faster than a hardware instruction.

In short, fsincos is too slow.

Hardware trig is obsolete.

There is enough evidence that the x86-64 platform has moved away from hardware trig.

  • amd64 prefers SSE over x87 for floats. Yet, SSE has no equivalents for x87 instructions like fsin.
  • For amd64, libm in both FreeBSD and glibc implement sin() and such functions in software, not with x87 trig. glibc has optimized x86-64 assembly for sinf() (the single-precision sine) with a polynomial approximation, not with x87 fsin. NetBSD and OpenBSD made the opposite choice: their libm for amd64 does use x87 instructions.
  • Steel Bank Common Lisp uses fsin in its x86 backend but not in its x86-64 backend. For x86-64, SBCL compiles code that calls sin() in libm.

Hardware trig loses the race.

I timed hardware and software sine on an AMD Phenom II X2 560 (3.3 GHz) from 2010. I wrote a C program with this loop:

volatile double a, s;
/* ... */
for (i = 0; i < 100000000; i++)
        s = sin(a);

I compiled this program twice, with two different implementations of sin(). The hard sin() uses x87 fsin. The soft sin() uses a polynomial approximation. My C compiler, gcc -O2, did not replace my sin() call with an inline fsin.

Here are results for sin(0.5):

$ time race-hard 0.5
    0m3.40s real     0m3.40s user     0m0.00s system
$ time race-soft 0.5
    0m1.13s real     0m1.15s user     0m0.00s system

Here soft sin(0.5) is so fast, this CPU would do soft sin(0.5) and soft cos(0.5) faster than one x87 fsin.

And for sin(123):

$ time race-hard 123
    0m3.61s real     0m3.62s user     0m0.00s system
$ time race-soft 123
    0m3.08s real     0m3.07s user     0m0.01s system

Soft sin(123) is slower than soft sin(0.5) because 123 is too large for the polynomial, so the function must subtract some multiple of 2π. If I also want cos(123), there is a chance that x87 fsincos would be faster than soft sin(123) and soft cos(123), for this CPU from 2010.

Fullblown answered 28/6, 2014 at 20:44 Comment(1)
I confirm: Even on on my ageing Intel Xeon E5420, a million fSinCos assembly instruction takes 644 mS against System:Math.Sin+System.Math.Cos 101 mSCullum
R
1

fsincos is an x87 FPU instruction which operates on 80-bit precision floats. It doesn't support autovectorization but provides much higher precision than 64-bit instructions.

sin and cos operate on instructions with 64-bit precision so just lower precision will already make them faster. Code that executes on the FPU (long double 80 bit type) will never get autovectorized because that's not supported but regular 64 bit code(up to double type) will, so that can make it multiple times faster with SSE/AVX/NEON etc.

FPU should be only used when you actually need 80-bit precision. Saying that it's obsolete is not completly accurate. It's only obsolete in 99% cases and it's still needed in 1% cases.

To see fsin and fcos being generated by the compiler use long double type(80 bit float) with sinl cosl functions.

Ritz answered 28/7, 2023 at 17:44 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.