I had the same issue, and so I benchmarked it. The result was that depending upon what you are doing, your compiler may be better at optimising than you, even if it doesn't use the intrinsic sincos
function.
I wrote a small test program to test using sincos
intrinsic, std::sin
with std::cos
, and std::sin
with cos calculated from sqrt(1-sin*sin)
The test involved generating 1e8 random numbers from 0-2*M_PI
. Each test calculated sin and cos for each random number, summing the values and then outputting the sum to stdout - this ensured the whole program wasn't optimised away. I compiled with O3 and fp:fast
Using sqrt(1-sin*sin)
was by far the slowest. This was because I needed an if statement to check the sign of the result. This meant the loop could not be vectorised.
The other options were similar speeds. Initially I had created a fastSinCos
function that accepted 4 doubles and returned 4 doubles. I then added the 4 doubles to the sum. This was slower than just using sum += std::sin(input[i])+std::cos(input[i])
. It turned out the compiler had vectorised the sum in the naive implementation so was beating me this way.
When I modified my code to create a ```fastSinCosSum`` function where the sums were vectorised I managed to beat the naive version, but only by 10%.
If I restricted the range of the inputs to M_PI/2.0
-3.0*M_PI/2.0
so I knew that the result of cos was always negative, then the speed was identical to the naive version.
As 1e8 doubles is bigger than my cache, I suspect cache misses could be the actual bottleneck. However, even then, the test only took around a second to run, so it seems daft to worry about it.
So in the end, unless a 10% gain in the most idealised setting matters to you, I suspect you are better off ensuring the compiler can vectorize rather than trying to use the intrinsic functions. The generated assembly for the three options is shown below.
fastSinCosSum(&inputs[i], r_sum3); //pass in the address of the first of the 4 elements to use and a register to store 4 sums
00007FF7FC321460 vmovupd ymm0,ymmword ptr [r14+rdi*8]
00007FF7FC321466 call __vdecl_sincos4 (07FF7FC322AD0h)
00007FF7FC32146B vaddpd ymm0,ymm0,ymmword ptr [r_sum3]
00007FF7FC321470 vaddpd ymm0,ymm0,ymm1
00007FF7FC321474 vmovupd ymmword ptr [r_sum3],ymm0
sum2 += std::sin(inputs[i]) + std::cos(inputs[i]); // just calculate naively
00007FF7FC321305 vmovupd ymm0,ymmword ptr [r14+rbx*8]
00007FF7FC32130B call __vdecl_cos4 (07FF7FC322A80h)
00007FF7FC321310 vmovupd ymmword ptr [rbp+60h],ymm0
00007FF7FC321315 vmovupd ymm0,ymmword ptr [r14+rbx*8]
00007FF7FC32131B call __vdecl_sin4 (07FF7FC322AA0h)
00007FF7FC321320 vaddpd ymm1,ymm0,ymmword ptr [rbp+60h]
00007FF7FC321325 vaddpd ymm0,ymm1,ymmword ptr [rbp+20h]
00007FF7FC32132A vmovupd ymmword ptr [rbp+20h],ymm0
double sin = std::sin(inputs[i]);
00007FF607CD1363 vmovupd ymm0,ymmword ptr [r14+rbx*8]
00007FF607CD1369 call __vdecl_sin4 (07FF607CD2A70h)
sum2a += sin - std::sqrt(1 - sin * sin); // calculate cos using sqrt. The angles are limited so we know the sign of the result is negative
00007FF607CD136E vmovupd ymm1,ymmword ptr [__ymm@3ff00000000000003ff00000000000003ff00000000000003ff0000000000000 (07FF607CD7480h)]
00007FF607CD1376 vfnmadd231pd ymm1,ymm0,ymm0
00007FF607CD137B vsqrtpd ymm1,ymm1
00007FF607CD137F vsubpd ymm0,ymm0,ymm1
00007FF607CD1383 vaddpd ymm1,ymm0,ymmword ptr [rbp+60h]
00007FF607CD1388 vmovupd ymmword ptr [rbp+60h],ymm1
sin()
and thencos()
viasqrt(1-sin^2)
, as OP proposes, is numerically more stable than calculatingcos()
and thensin()
viasqrt(1-cos^2)
. Good that is not suggested. For small angles the difference is apparent. – Rhapsodicy=sin(x)
and Cosine(x) assqrt(1-y*y)
yield sin, cos ofx
and1.0
. Doing Doing Cosine(x) asy=cos(x)
and Sine(x) assqrt(1-y*y)
yields sin, cos of**0.0**
and1.0
, a total loss of precision in sine. As x grows the issues becomes less until x is about 1.0. Number theory could follow, but this deserves its own question. – Rhapsodicsqrt(1-sin^2)
is a poor approximation tocos
near pi/2. In gnuplot:plot [0:1e-7] sqrt(1-sin(pi/2-x)**2), cos(pi/2-x)
No? You seem to be asserting otherwise, but I didn't follow your argument. – Sorosissqrt(1-sin^2)
isn't a good way to compute cos near pi/2, and I have that same impression. So I made this plot which seems to confirm our impression thatsqrt(1-sin^2)
isn't very good for arguments near pi/2, and I'm asking you if you agree. – Sorosis1-pow(sin(pi/2-x),2)
suffers severe lost of precisions forx
near pi/2. Certainly OK for course answers, yet not of high quality. – Rhapsodicsin(near pi)
problems are likecos(near pi/2)
. – Rhapsodicsin
andsqrt(1-sin^2)
vscos
andsqrt(1-cos^2)
, if I understand correctly, what TQP proposed earlier was exactly correct: the former is better when closer to zero, and the latter is better when closer to pi/2. – Sorosis