What is the fastest way to compute sin and cos together?

Asked 21/4, 2010 at 14:5 Answered 30/6, 2019 at 10:26

112

I would like to compute both the sine and co-sine of a value together (for example to create a rotation matrix). Of course I could compute them separately one after another like a = cos(x); b = sin(x);, but I wonder if there is a faster way when needing both values.

Edit: To summarize the answers so far:

Vlad said, that there is the asm command FSINCOS computing both of them (in almost the same time as a call to FSIN alone)
Like Chi noticed, this optimization is sometimes already done by the compiler (when using optimization flags).
caf pointed out, that functions sincos and sincosf are probably available and can be called directly by just including math.h
tanascius approach of using a look-up table is discussed controversial. (However on my computer and in a benchmark scenario it runs 3x faster than sincos with almost the same accuracy for 32-bit floating points.)
Joel Goodwin linked to an interesting approach of an extremly fast approximation technique with quite good accuray (for me, this is even faster then the table look-up)

Lifesize answered 21/4, 2010 at 14:5 Comment(4)

See also this question about native implementation of sin/cos: stackoverflow.com/questions/1640595 – Heretical 21/4, 2010 at 15:29

try sinx ~ x-x^3/6 and cosx~1-x^2/4 as approximations if you care about speed more than accuracy. You can add on terms in either series as you put more weight on accuracy (en.wikipedia.org/wiki/Taylor_series scroll down to trig taylor series.) Note this is a general way to approximate any function you want that is differntiable n times. So if you have some bigger function that that sine's and cosine's belong to you will get a much bigger speed up if you approximate it instead of the sin,cos's independently. – Trinitytrinket 23/4, 2010 at 3:43

This is poor technique with very poor accuracy. See post by Joel Goodwin. Taylor series have been posted below. Please post it as an answer. – Lifesize 23/4, 2010 at 9:50

Well it depends on your requirements, if you want accuracy Taylor series will be a good approximation only if you need values of x close to some point x_0, then expand your Taylor series around x_0 instead of 0. This will give you excellent accuracy near x_0 but the farther you go the worse the results. You probably thought the accuracy sucks cause as you looked at the given asnwer and tried it for values far from 0. That answer is with sin,cos expanded around 0. – Trinitytrinket 23/4, 2010 at 16:45

Modern Intel/AMD processors have instruction FSINCOS for calculating sine and cosine functions simultaneously. If you need strong optimization, perhaps you should use it.

Here is a small example: http://home.broadpark.no/~alein/fsincos.html

Here is another example (for MSVC): http://www.codeguru.com/forum/showthread.php?t=328669

Here is yet another example (with gcc): http://www.allegro.cc/forums/thread/588470

Hope one of them helps. (I didn't use this instruction myself, sorry.)

As they are supported on processor level, I expect them to be way much faster than table lookups.

Edit:
Wikipedia suggests that FSINCOS was added at 387 processors, so you can hardly find a processor which doesn't support it.

Edit:
Intel's documentation states that FSINCOS is just about 5 times slower than FDIV (i.e., floating point division).

Edit:
Please note that not all modern compilers optimize calculation of sine and cosine into a call to FSINCOS. In particular, my VS 2008 didn't do it that way.

Edit:
The first example link is dead, but there is still a version at the Wayback Machine.

Barrios answered 21/4, 2010 at 14:17 Comment(13)

Do you have any reference for this or a snippet how to call it in C or C#? – Lifesize 21/4, 2010 at 14:20

Are compilers smart enough to use this instruction in case you write: x=cos(a);y=sin(a); That would be great. – Raasch 21/4, 2010 at 14:41

@phkahler: That would be great. Don't know if such optimization is used by the modern compilers. – Barrios 21/4, 2010 at 14:59

Do note that if you don't need a full 80 bit double extended result (just double or single precision, for example), it is possible to compute cos and sin much faster in software than with the fsincos hardware instruction. On the other hand, it's also (much) more work to implement that way. – Scammon 21/4, 2010 at 15:8

@Stephen: which way is it faster to do in software? As Intel's manual states, the FSINCOS instruction itself is quite fast. – Barrios 21/4, 2010 at 15:11

The fsincos instruction is not "quite fast". Intel's own optimization manual quotes it as requiring between 119 and 250 cycles on recent micro-architectures. Intel's math library (distributed with ICC), by comparison, can separately compute sin and cos in less than 100 cycles, using a software implementation that uses SSE instead of the x87 unit. A similar software implementation that computed both simultaneously could be faster still. – Scammon 21/4, 2010 at 15:50

@Stephen: could you please post the assembler code? I suppose the code uses some version of built-in sin computation (perhaps SSE-specific), and that sincos is available as well. – Barrios 21/4, 2010 at 16:8

@Stephen: and ~200 cycles is really fast: FDIV requires 40. – Barrios 21/4, 2010 at 16:10

@Vlad: The ICC math libraries are not open-source, and I don't have a license to redistribute them, so I can't post the assembly. I can tell you that there is no built-in sin computation for them to take advantage of, however; they use the same SSE instructions as everyone else. To your second comment, the speed relative to fdiv is immaterial; if there are two ways to do something and one is twice as fast as the other, it doesn't make sense to call the slower one "fast", regardless of how long it takes relative to some completely unrelated task. – Scammon 21/4, 2010 at 16:23

@Stephen: I've looked through the Intel's manual; they mention the precision of the functions separately. Without the assembly code I would assume that they achieve speed by sacrificing the accuracy of computation. Could you confirm or disprove this assumption? – Barrios 21/4, 2010 at 16:29

The software sin function in their library delivers full double-precision accuracy. The fsincos instruction delivers somewhat more accuracy (double extended), but that extra accuracy gets thrown away in most programs that call the sin function, as its result is usually rounded to double precision by later arithmetic operations or a store to memory. In most situations, they deliver the same accuracy for practical use. – Scammon 21/4, 2010 at 17:19

Note also that fsincos is not a complete implementation on its own; you need an additional range reduction step to put the argument into the valid input range for the fsincos instruction. The library sin and cos functions include this reduction as well as the core computation, so they are even faster (by comparison) than the cycle timings that I listed might indicate. – Scammon 21/4, 2010 at 17:22

You don't have to perform argument reduction on sane fsincos operands, but using out of range values will make it take more cycles, and will unavoidably reduce accuracy. The input has to be in the range −(2^63) to +(2^63), pretty far outside +/- PI. But anyway, you should use SSE, not fsincos. – Ferryboat 7/7, 2022 at 17:10

Modern x86 processors have a fsincos instruction which will do exactly what you're asking - calculate sin and cos at the same time. A good optimizing compiler should detect code which calculates sin and cos for the same value and use the fsincos command to execute this.

It took some twiddling of compiler flags for this to work, but:

$ gcc --version
i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5488)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ cat main.c
#include <math.h> 

struct Sin_cos {double sin; double cos;};

struct Sin_cos fsincos(double val) {
  struct Sin_cos r;
  r.sin = sin(val);
  r.cos = cos(val);
  return r;
}

$ gcc -c -S -O3 -ffast-math -mfpmath=387 main.c -o main.s

$ cat main.s
    .text
    .align 4,0x90
.globl _fsincos
_fsincos:
    pushl   %ebp
    movl    %esp, %ebp
    fldl    12(%ebp)
    fsincos
    movl    8(%ebp), %eax
    fstpl   8(%eax)
    fstpl   (%eax)
    leave
    ret $4
    .subsections_via_symbols

Tada, it uses the fsincos instruction!

Almost answered 21/4, 2010 at 14:46 Comment(5)

This is cool! Could you explain what -mfpmath=387 is doing? And does it also work with MSVC? – Lifesize 21/4, 2010 at 14:51

Note that -ffast-math and -mfpmath lead to different results in some cases. – Tache 21/4, 2010 at 15:6

mfpmath=387 will force gcc to use x87 instructions instead of SSE instructions. I suspect MSVC has similar optimizations and flags, but i don't have MSVC handy to be sure. Using x87 instructions will likely be a detriment to performance in other code though, you should also look at my other answer, to use Intel's MKL. – Almost 21/4, 2010 at 15:18

My old gcc 3.4.4 from cygwin produces 2 separate calls to fsin and fcos. :-( – Barrios 21/4, 2010 at 15:19

Tried with Visual Studio 2008 with highest optimizations enabled. It calls 2 library functions __CIsin and __CIcos. – Barrios 21/4, 2010 at 15:24

When you need performance, you could use a precalculated sin/cos table (one table will do, stored as a Dictionary). Well, it depends on the accuracy you need (maybe the table would be too big), but it should be really fast.

Millhon answered 21/4, 2010 at 14:8 Comment(5)

Then the input value needs to be mapped to [0,2*pi] (or smaller with additional checks) and this call to fmod eats away performance. In my (propably suboptimal) implementation I could not gain performance with the look-up table. Would you have any advice here? – Lifesize 21/4, 2010 at 14:25

A precomputed table will almost certainly be slower than just calling sin because the precomputed table will trash the cache. – Thermopile 21/4, 2010 at 14:25

It depends how big the table is. A 256-entry table is often quite accurate enough and uses only 1Kb... if you use it a lot wouldn't it get stuck in the cache without adversely affecting the rest of the app's performance? – Pairoar 21/4, 2010 at 15:16

@Danvil: Here is an example of a sine lookup table en.wikipedia.org/wiki/Lookup_table#Computing_sines. However it assumes that you already mapped your input to [0;2pi], too. – Millhon 21/4, 2010 at 15:48

@AndreasBrinck I wouldn't go that far. It Depends(TM). Modern caches are huge and lookup tables are small. Quite often if you take a bit of care in memory layout your lookup table need not make any difference to the cache utilisation of the rest of your computation. The fact that the lookup table fits inside the cache is one of the reasons it is so fast. Even in Java where it's difficult to control mem layout precisely, I've had massive performance wins with lookup tables. – Foiled 23/10, 2012 at 7:25

Technically, you’d achieve this by using complex numbers and Euler’s Formula. Thus, something like (C++)

complex<double> res = exp(complex<double>(0, x));
// or equivalent
complex<double> res = polar<double>(1, x);
double sin_x = res.imag();
double cos_x = res.real();

should give you sine and cosine in one step. How this is done internally is a question of the compiler and library being used. It could (and might) well take longer to do it this way (just because Euler’s Formula is mostly used to compute the complex exp using sin and cos – and not the other way round) but there might be some theoretical optimisation possible.

Edit

The headers in <complex> for GNU C++ 4.2 are using explicit calculations of sin and cos inside polar, so it doesn’t look too good for optimisations there unless the compiler does some magic (see the -ffast-math and -mfpmath switches as written in Chi’s answer).

Tache answered 21/4, 2010 at 14:22 Comment(1)

sorry, but Euler's Formula doesn't actually tell you how to compute something, it's just an identity (albeit a very useful one) that relates complex exponentials to real trigonometric functions. There are benefits of calculating sine and cosine together, but they involve common subexpressions and your answer does not discuss this. – Castellan 29/6, 2017 at 2:42

You could compute either and then use the identity:

cos(x)² = 1 - sin(x)²

but as @tanascius says, a precomputed table is the way to go.

Supplejack answered 21/4, 2010 at 14:8 Comment(10)

And be aware that using this method involves computing a power and a square root, so if performance is important, make sure to verify that this is actually faster than computing the other trig function directly. – Roby 21/4, 2010 at 14:11

sqrt() is often optimized in hardware, so it may very well be faster then sin() or cos(). The power is just self multiplication, so don't use pow(). There are some tricks to get reasonably accurate square-roots very quickly without hardware support. Lastly, be sure to profile before doing any of this. – Cajolery 21/4, 2010 at 14:23

This is an interesting approach, though the additional square root and checks for the right sign, could reduce the performance. Any concrete advice here? – Lifesize 21/4, 2010 at 14:26

Note that √(1 - cos^2 x) is less accurate than computing sin x directly, in particular when x ~ 0. – Evocator 21/4, 2010 at 14:37

For small x, the Taylor series for y=sqrt(1-x*x) is very nice. You can get good accuracy with the first 3 terms and it only requires a few multiplies and one shift. I've used it in fixed point code. – Raasch 21/4, 2010 at 14:46

@phkahler: Your Taylor series doesn't apply because when x ~ 0, cos x ~ 1. – Evocator 21/4, 2010 at 15:38

@Kenny - I use it for sin values up to about 0.4 where cos is not really 1. – Raasch 21/4, 2010 at 15:44

"compute either and then use the identity" --> better to compute sin() first to retain precision. – Acquaint 2/7, 2019 at 13:29

Would it be easier to compute tan(x/2), then find cos(x) = (1-tan^2(x/2))/(1+tan^2(x/2)) and sin(x) = 2tan(x/2)/(1+tan^2(x/2)) ? – Wilser 8/4, 2022 at 0:42

easier where? ..... – Supplejack 8/4, 2022 at 1:48

If you use the GNU C library, then you can do:

#define _GNU_SOURCE
#include <math.h>

and you will get declarations of the sincos(), sincosf() and sincosl() functions that calculate both values together - presumably in the fastest way for your target architecture.

Stow answered 21/4, 2010 at 23:54 Comment(0)

There is very interesting stuff on this forum page, which is focused on finding good approximations that are fast: http://www.devmaster.net/forums/showthread.php?t=5784

Disclaimer: Not used any of this stuff myself.

Update 22 Feb 2018: Wayback Machine is the only way to visit the original page now: https://web.archive.org/web/20130927121234/http://devmaster.net/posts/9648/fast-and-accurate-sine-cosine

Heretical answered 21/4, 2010 at 14:33 Comment(6)

I tried this one as well, and it gave me quite good performance. But sin and cos are computed independently. – Lifesize 21/4, 2010 at 14:44

My feeling is this sine/cosine calculation will be faster than getting sine and using a square root approximation to get cosine, but a test will verify that. The primary relationship between sine and cosine is one of phase; is it possible to code so you could re-use the sine values you calculate for the phase-shifted cosine calls by taking this into account? (This may be a stretch, but had to ask) – Heretical 21/4, 2010 at 14:52

Not directly (despite the question asking exactly this). I need sin and cos of a value x and there is no way to know if at some other place I coincidentally computed x+pi/2 ... – Lifesize 21/4, 2010 at 15:6

I used it in my game to draw a circle of particles. Since it's just a visual effect, the result is close enough, and perfomance is really impressive. – Wastage 9/1, 2015 at 14:2

I'm not impressed; Chebyshev approximations usually give you the most accuracy for a given performance. – Castellan 29/6, 2017 at 2:45

@EvanBenn looks indeed like the whole forum has been scrubbed so I've added a Wayback Machine link to the original – Heretical 22/2, 2018 at 9:17

Many C math libraries, as caf indicates, already have sincos(). The notable exception is MSVC.

Sun has had sincos() since at least 1987 (twenty-three years; I have a hard-copy man page)
HPUX 11 had it in 1997 (but isn't in HPUX 10.20)
Added to glibc in version 2.1 (Feb 1999)
Became a built-in in gcc 3.4 (2004), __builtin_sincos().

And regarding look-up, Eric S. Raymond in the Art of Unix Programming (2004) (Chapter 12) says explicitly this a Bad Idea (at the present moment in time):

"Another example is precomputing small tables--for example, a table of sin(x) by degree for optimizing rotations in a 3D graphics engine will take 365 × 4 bytes on a modern machine. Before processors got enough faster than memory to demand caching, this was an obvious speed optimization. Nowadays it may be faster to recompute each time rather than pay for the percentage of additional cache misses caused by the table.

"But in the future, this might turn around again as caches grow larger. More generally, many optimizations are temporary and can easily turn into pessimizations as cost ratios change. The only way to know is to measure and see." (from the Art of Unix Programming)

But, judging from the discussion above, not everyone agrees.

Platas answered 30/4, 2010 at 7:9 Comment(3)

"365 x 4 bytes". You need to account for leap years, so that should actually be 365.25 x 4 bytes. Or maybe he meant to use the number of degrees in a circle instead of the number of days in an earth year. – Smyth 15/5, 2012 at 3:51

@Wallacoloo: Nice observation. I missed it. But the error is in the original. – Platas 16/5, 2012 at 14:3

LOL. Plus, he neglects the fact that in many of the computer games of that area, you will only need a finite number of angles. There are no cache misses then, if you know the possible angles. I'd use tables exactly in this case, and give fsincos (CPU instruction!) a try for the others. It's often as fast as interpolating sin and cos from a large table. – Atypical 5/1, 2013 at 15:32

I don't believe that lookup tables are necessarily a good idea for this problem. Unless your accuracy requirements are very low the table needs to be very large. And modern CPUs can do a lot of computation while a value is fetched from main memory. This is not one of those questions which can be properly answered by argument (not even mine), test and measure and consider the data.

But I'd look to the fast implementations of SinCos that you find in libraries such as AMD's ACML and Intel's MKL.

Arrester answered 21/4, 2010 at 14:27 Comment(0)

If you are willing to use a commercial product, and are calculating a number of sin/cos calculations at the same time (so you can use vectored functions), you should check out Intel's Math Kernel Library.

(dead link) It has a sincos function

According to that documentation, it averages 13.08 clocks/element on core 2 duo in high accuracy mode, which i think will be even faster than fsincos.

Almost answered 21/4, 2010 at 15:10 Comment(1)

Similarly, on OSX one can use vvsincos or vvsincosf from the Accelerate.framework. I believe that AMD has similar functions in their vector library as well. – Scammon 21/4, 2010 at 21:28

This article shows how to construct a parabolic algorithm that generates both the sine and the cosine:

DSP Trick: Simultaneous Parabolic Approximation of Sin and Cos

http://www.dspguru.com/dsp/tricks/parabolic-approximation-of-sin-and-cos

Canarese answered 13/5, 2010 at 9:35 Comment(1)

hmmm... I need to do a shootout between this and Chebyshev approximation which I think would win. – Castellan 29/6, 2017 at 2:47

When performance is critical for this kind of thing it is not unusual to introduce a lookup table.

Zigzagger answered 21/4, 2010 at 14:9 Comment(0)

For a creative approach, how about expanding the Taylor series? Since they have similar terms, you could do something like the following pseudo:

numerator = x
denominator = 1
sine = x
cosine = 1
op = -1
fact = 1

while (not enough precision) {
    fact++
    denominator *= fact
    numerator *= x

    cosine += op * numerator / denominator

    fact++
    denominator *= fact
    numerator *= x

    sine += op * numerator / denominator

    op *= -1
}

This means you do something like this: starting at x and 1 for sin and cosine, follow the pattern - subtract x^2 / 2! from cosine, subtract x^3 / 3! from sine, add x^4 / 4! to cosine, add x^5 / 5! to sine...

I have no idea whether this would be performant. If you need less precision than the built in sin() and cos() give you, it may be an option.

Jefferson answered 21/4, 2010 at 14:44 Comment(3)

Actually the i-the sine extension factor is x/i times the i-the cosine extension factor. But I would doubt that using the Taylor series is really fast ... – Lifesize 21/4, 2010 at 14:48

Chebyshev is much better than Taylor for polynomial function approximation. Don't use Taylor approximation. – Egghead 28/3, 2017 at 8:28

There are a bunch of numerical faux pas here; numerator and denominator both quickly become large and that leads to floating-point errors. Not to mention how do you decide what "not enough precision" is and how to calculate it? Taylor approximation is good in the neighborhood around a single point; away from that point they quickly become inaccurate and require a large number of terms, which is why Timmmm's suggestion about Chebyshev approximation (which creates good approximations over a given interval) is a good one. – Castellan 29/6, 2017 at 2:52

There is a nice solution in the CEPHES library which can be pretty fast and you can add/remove accuracy quite flexibly for a bit more/less CPU time.

Remember that cos(x) and sin(x) are the real and imaginary parts of exp(ix). So we want to calculate exp(ix) to get both. We precalculate exp(iy) for some discrete values of y between 0 and 2pi. We shift x to the interval [0, 2pi). Then we select the y that is closest to x and write
exp(ix)=exp(iy+(ix-iy))=exp(iy)exp(i(x-y)).

We get exp(iy) from the lookup table. And since |x-y| is small (at most half the distance between the y-values), the Taylor series will converge nicely in just a few terms, so we use that for exp(i(x-y)). And then we just need a complex multiplication to get exp(ix).

Another nice property of this is that you can vectorize it using SSE.

Kattegat answered 4/11, 2011 at 15:7 Comment(0)

You may want to have a look at http://gruntthepeon.free.fr/ssemath/, which offers an SSE vectorized implementation inspired from CEPHES library. It has good accuracy (maximum deviation from sin/cos on the order of 5e-8) and speed (slightly outperforms fsincos on a single call basis, and a clear winner over multiple values).

Capillary answered 24/5, 2014 at 0:40 Comment(0)

I have posted a solution involving inline ARM assembly capable of computing both the sine and cosine of two angles at a time here: Fast sine/cosine for ARMv7+NEON

Shivery answered 9/5, 2010 at 13:14 Comment(0)

An accurate yet fast approximation of sin and cos function simultaneously, in javascript, can be found here: http://danisraelmalta.github.io/Fmath/ (easily imported to c/c++)

Pudens answered 15/9, 2013 at 20:14 Comment(0)

Have you thought of declaring lookup tables for the two functions? You'd still have to "calculate" sin(x) and cos(x), but it'd be decidedly faster, if you don't need a high degree of accuracy.

Pb answered 21/4, 2010 at 14:11 Comment(0)

The MSVC compiler may use the (internal) SSE2 functions

 ___libm_sse2_sincos_ (for x86)
 __libm_sse2_sincos_  (for x64)

in optimized builds if appropriate compiler flags are specified (at minimum /O2 /arch:SSE2 /fp:fast). The names of these functions seem to imply that they do not compute separate sin and cos, but both "in one step".

For example:

void sincos(double const x, double & s, double & c)
{
  s = std::sin(x);
  c = std::cos(x);
}

Assembly (for x86) with /fp:fast:

movsd   xmm0, QWORD PTR _x$[esp-4]
call    ___libm_sse2_sincos_
mov     eax, DWORD PTR _s$[esp-4]
movsd   QWORD PTR [eax], xmm0
mov     eax, DWORD PTR _c$[esp-4]
shufpd  xmm0, xmm0, 1
movsd   QWORD PTR [eax], xmm0
ret     0

Assembly (for x86) without /fp:fast but with /fp:precise instead (which is the default) calls separate sin and cos:

movsd   xmm0, QWORD PTR _x$[esp-4]
call    __libm_sse2_sin_precise
mov     eax, DWORD PTR _s$[esp-4]
movsd   QWORD PTR [eax], xmm0
movsd   xmm0, QWORD PTR _x$[esp-4]
call    __libm_sse2_cos_precise
mov     eax, DWORD PTR _c$[esp-4]
movsd   QWORD PTR [eax], xmm0
ret     0

So /fp:fast is mandatory for the sincos optimization.

But please note that

___libm_sse2_sincos_

is maybe not as precise as

__libm_sse2_sin_precise
__libm_sse2_cos_precise

due to the missing "precise" at the end of its name.

On my "slightly" older system (Intel Core 2 Duo E6750) with the latest MSVC 2019 compiler and appropriate optimizations, my benchmark shows that the sincos call is about 2.4 times faster than separate sin and cos calls.

Adelaadelaida answered 30/6, 2019 at 10:26 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags