Fast Arc Cos algorithm?

Asked 1/8, 2010 at 3:31 Answered 20/3 at 2:10

I have my own, very fast cos function:

float sine(float x)
{
    const float B = 4/pi;
    const float C = -4/(pi*pi);

    float y = B * x + C * x * abs(x);

    //  const float Q = 0.775;
    const float P = 0.225;

    y = P * (y * abs(y) - y) + y;   // Q * y + P * y * abs(y)


    return y;
}

float cosine(float x)
{
    return sine(x + (pi / 2));
}

But now when I profile, I see that acos() is killing the processor. I don't need intense precision.

What is a fast way to calculate acos(x) ?

Epicalyx answered 1/8, 2010 at 3:31 Comment(3)

Your very fast function has a mean error of 16% in [-pi,pi] and is entirely unusable outside that interval. The standard sinf from math.h on my system takes only about 2.5x as much time as your approximation. Considering your function is inlined and the lib call is not, this is really not much difference. My guess is if you added range reduction so it was usuable in the same way as the standard function, you would have exactly the same speed. – Holmgren 5/7, 2012 at 14:50

No, the maximum error is 0.001 (1/10th %). Did you forget to apply the correction? (y = P * bla...) Look at the original source and discussion here: devmaster.net/forums/topic/4648-fast-and-accurate-sinecosine Second, sin and cos pre-bounded by +-pi is a VERY common case, especially in graphics and simulation, both of which often require a fast approximate sin/cos. – Runesmith 18/10, 2012 at 20:15

This is a really intriguing problem, thanks for asking! – Graaf 30/7, 2019 at 19:56

A simple cubic approximation, the Lagrange polynomial for x ∈ {-1, -½, 0, ½, 1}, is:

double acos(x) {
   return (-0.69813170079773212 * x * x - 0.87266462599716477) * x + 1.5707963267948966;
}

It has a maximum error of about 0.18 rad.

Baryta answered 1/8, 2010 at 4:17 Comment(2)

Maximum error is 10.31 in degrees. Rather big, but in some solutions may be enough. Suitable where computational speed is more important than precision. May be quartic approximation would produce more precision and still be faster than native acos? – Companionway 2/2, 2013 at 17:22

Sure there isn't a mistake in this formular? Just tried it with Wolfram Alpha and it doesn't look right: wolframalpha.com/input/?i=y%3D%282%2F9*pixx-5*pi%2F18%29*x%2Bpi%2F2 – Crotchet 26/4, 2013 at 14:29

Got spare memory? A lookup table (with interpolation, if required) is gonna be fastest.

Blowtorch answered 1/8, 2010 at 3:36 Comment(4)

How could I implement this as a C function? – Epicalyx 1/8, 2010 at 3:42

@Jex: bounds-check your argument (it must be between -1 and 1). Then multiply by a nice power of 2, say 64, yielding the range (-64, 64). Add 64 to make it non-negative (0, 128). Use the integer part to index a lookup table, if desired use the fractional part for interpolation between the two closest entries. If you don't want interpolation, try adding 64.5 and take the floor, this is the same as round-to-nearest. – Singleminded 1/8, 2010 at 5:42

Lookup tables require an index, which is going to require a float to int conversion, which will probably kill performance. – Carthy 2/8, 2010 at 13:15

@phkahler: float-to-int conversion is very cheap on x86, almost as cheap as an FP add, as you can see in Agner Fog's latency/throughput/uop tables. Range-checking the index to make sure it doesn't index outside the table is probably about as expensive. int idx = x * 4096.0 will have ~ 9 cycle latency on Intel Haswell. By far the most expensive part of this is the cache miss from a decent-sized table. If there isn't a bunch of parallel computation that doesn't depend on the acos result, a large table will probably be slower (esp. with cache competition). – Present 27/3, 2016 at 11:58

nVidia has some great resources that show how to approximate otherwise very expensive math functions, such as: acos asin atan2 etc etc...

These algorithms produce good results when speed of execution is more important (within reason) than precision. Here's their acos function:

// Absolute error <= 6.7e-5
float acos(float x) {
  float negate = float(x < 0);
  x = abs(x);
  float ret = -0.0187293;
  ret = ret * x;
  ret = ret + 0.0742610;
  ret = ret * x;
  ret = ret - 0.2121144;
  ret = ret * x;
  ret = ret + 1.5707288;
  ret = ret * sqrt(1.0-x);
  ret = ret - 2 * negate * ret;
  return negate * 3.14159265358979 + ret;
}

And here are the results for when calculating acos(0.5):

nVidia:   result: 1.0471513828611643
math.h:   result: 1.0471975511965976

That's pretty close! Depending on your required degree of precision, this might be a good option for you.

Delao answered 25/9, 2014 at 4:19 Comment(7)

In contrast to what the comment in the "reference implementation" on the nvidia site says, the absolute error is not <= 6.7e-5 but I was able to observe an error of 6.759167e-05. Also, how sure are you that this function is actually faster than just plain acos? On a 4th generation core i5 (Haswell) the nvidia function was consistenly 25% slower than acos from math.h. – Wharton 2/10, 2016 at 9:30

Any reason to use 3.14159265358979 instead of math.pi ? – Reimpression 27/6, 2017 at 5:31

@Reimpression Python was irrelevant to my answer. My goal was just to point out to nVidia's docs for approximation methods as a resource. So I edited my answer for clarity. But to answer your question: my guess is that these numbers were chosen to work well together. So using math.pi might not seem to make much of a difference in most cases until you hit the edge case where the error threshold would worsen. – Delao 29/6, 2017 at 4:44

The reason for asking is there is a very small difference leading to the question: When porting this to any other language - should the constant for pi be used? Or is this an intentionally slightly modified pi, tuned to work better with the approximation? (can be tested without too much trouble of course) – Reimpression 30/6, 2017 at 0:6

what does float(x < 0) mean? How can I turn it to scala? – Ardra 1/7, 2017 at 10:5

@Ardra it coverts the true/false result of the x<0 expression into a float. So if x<0 is true it converts it to 1.0, if false: 0.0 – Delao 1/7, 2017 at 18:33

This approximation is within 10e-5 of the built-in acos() in a desktop opengl implementation I tried. Quite good! But it's also basically the same speed as the built-in acos() on a low end Android device I tested on. – Berber 4/12, 2017 at 23:40

I have my own. It's pretty accurate and sort of fast. It works off of a theorem I built around quartic convergence. It's really interesting, and you can see the equation and how fast it can make my natural log approximation converge here: https://www.desmos.com/calculator/yb04qt8jx4

Here's my arccos code:

function acos(x)
    local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
    local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
    local c=0.88-0.77*x c=(c+(2-a)/c)/2
    return (8*(c+(2-a)/c)-(b+(2-2*x)/b))/6
end

A lot of that is just square root approximation. It works really well, too, unless you get too close to taking a square root of 0. It has an average error (excluding x=0.99 to 1) of 0.0003. The problem, though, is that at 0.99 it starts going to shit, and at x=1, the difference in accuracy becomes 0.05. Of course, this could be solved by doing more iterations on the square roots (lol nope) or, just a little thing like, if x>0.99 then use a different set of square root linearizations, but that makes the code all long and ugly.

If you don't care about accuracy so much, you could just do one iteration per square root, which should still keep you somewhere in the range of 0.0162 or something as far as accuracy goes:

function acos(x)
    local a=1.43+0.59*x a=(a+(2+2*x)/a)/2
    local b=1.65-1.41*x b=(b+(2-2*x)/b)/2
    local c=0.88-0.77*x c=(c+(2-a)/c)/2
    return 8/3*c-b/3
end

If you're okay with it, you can use pre-existing square root code. It will get rid of the the equation going a bit crazy at x=1:

function acos(x)
    local a = math.sqrt(2+2*x)
    local b = math.sqrt(2-2*x)
    local c = math.sqrt(2-a)
    return 8/3*d-b/3
end

Frankly, though, if you're really pressed for time, remember that you could linearize arccos into 3.14159-1.57079x and just do:

function acos(x)
    return 1.57079-1.57079*x
end

Anyway, if you want to see a list of my arccos approximation equations, you can go to https://www.desmos.com/calculator/tcaty2sv8l I know that my approximations aren't the best for certain things, but if you're doing something where my approximations would be useful, please use them, but try to give me credit.

Classic answered 3/1, 2014 at 23:31 Comment(4)

Is this what you meant for the last one? 1.57079-1.57079*x. – Exostosis 30/6, 2015 at 16:26

Also for anyone using c# this might be a good firstline: if (x < -1D || x > 1D || Double.IsNaN(x)) return Double.NaN; to be consistent with the .net framework acos function: msdn.microsoft.com/en-us/library/… – Exostosis 30/6, 2015 at 19:26

Your first implementation is way off at x = -1, like by 0.5rad. – Berber 1/12, 2017 at 3:56

The question is tagged C as well as C++. You don't even hint syntax highlighting the code in this answer may be Lua. – Small 20/3 at 14:27

You can approximate the inverse cosine with a polynomial as suggested by dan04, but a polynomial is a pretty bad approximation near -1 and 1 where the derivative of the inverse cosine goes to infinity. When you increase the degree of the polynomial you hit diminishing returns quickly, and it is still hard to get a good approximation around the endpoints. A rational function (the quotient of two polynomials) can give a much better approximation in this case.

acos(x) ≈ π/2 + (ax + bx³) / (1 + cx² + dx⁴)

where

a = -0.939115566365855
b =  0.9217841528914573
c = -1.2845906244690837
d =  0.295624144969963174

has a maximum absolute error of 0.017 radians (0.96 degrees) on the interval (-1, 1). Here is a plot (the inverse cosine in black, cubic polynomial approximation in red, the above function in blue) for comparison:

The coefficients above have been chosen to minimise the maximum absolute error over the entire domain. If you are willing to allow a larger error at the endpoints, the error on the interval (-0.98, 0.98) can be made much smaller. A numerator of degree 5 and a denominator of degree 2 is about as fast as the above function, but slightly less accurate. At the expense of performance you can increase accuracy by using higher degree polynomials.

A note about performance: computing the two polynomials is still very cheap, and you can use fused multiply-add instructions. The division is not so bad, because you can use the hardware reciprocal approximation and a multiply. The error in the reciprocal approximation is negligible in comparison with the error in the acos approximation. On a 2.6 GHz Skylake i7, this approximation can do about 8 inverse cosines every 6 cycles using AVX. (That is throughput, the latency is longer than 6 cycles.)

Paronomasia answered 3/4, 2016 at 16:25 Comment(2)

Is there a source for these coefficients? – Polydeuces 30/6, 2021 at 9:20

@Polydeuces they are computed by this script: github.com/ruuda/convector/blob/… – Paronomasia 1/7, 2021 at 18:39

Another approach you could take is to use complex numbers. From de Moivre's formula,

ⅈ^x = cos(π/2*x) + ⅈ*sin(π/2*x)

Let θ = π/2*x. Then x = 2θ/π, so

sin(θ) = ℑ(ⅈ^^2θ/π)
cos(θ) = ℜ(ⅈ^^2θ/π)

How can you calculate powers of ⅈ without sin and cos? Start with a precomputed table for powers of 2:

ⅈ⁴ = 1
ⅈ² = -1
ⅈ¹ = ⅈ
ⅈ^1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ
ⅈ^1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ
ⅈ^1/8 = 0.9807852804032304 + 0.19509032201612825*ⅈ
ⅈ^1/16 = 0.9951847266721969 + 0.0980171403295606*ⅈ
ⅈ^1/32 = 0.9987954562051724 + 0.049067674327418015*ⅈ
ⅈ^1/64 = 0.9996988186962042 + 0.024541228522912288*ⅈ
ⅈ^1/128 = 0.9999247018391445 + 0.012271538285719925*ⅈ
ⅈ^1/256 = 0.9999811752826011 + 0.006135884649154475*ⅈ

To calculate arbitrary values of ⅈ^x, approximate the exponent as a binary fraction, and then multiply together the corresponding values from the table.

For example, to find sin and cos of 72° = 0.8π/2:

ⅈ^0.8 ≈ ⅈ^205/256 = ⅈ^0b11001101 = ⅈ^1/2 * ⅈ^1/4 * ⅈ^1/32 * ⅈ^1/64 * ⅈ^1/256
= 0.3078496400415349 + 0.9514350209690084*ⅈ

sin(72°) ≈ 0.9514350209690084 ("exact" value is 0.9510565162951535)
cos(72°) ≈ 0.3078496400415349 ("exact" value is 0.30901699437494745).

To find asin and acos, you can use this table with the Bisection Method:

For example, to find asin(0.6) (the smallest angle in a 3-4-5 triangle):

ⅈ⁰ = 1 + 0*ⅈ. The sin is too small, so increase x by 1/2.
ⅈ^1/2 = 0.7071067811865476 + 0.7071067811865475*ⅈ . The sin is too big, so decrease x by 1/4.
ⅈ^1/4 = 0.9238795325112867 + 0.3826834323650898*ⅈ. The sin is too small, so increase x by 1/8.
ⅈ^3/8 = 0.8314696123025452 + 0.5555702330196022*ⅈ. The sin is still too small, so increase x by 1/16.
ⅈ^7/16 = 0.773010453362737 + 0.6343932841636455*ⅈ. The sin is too big, so decrease x by 1/32.
ⅈ^13/32 = 0.8032075314806449 + 0.5956993044924334*ⅈ.

Each time you increase x, multiply by the corresponding power of ⅈ. Each time you decrease x, divide by the corresponding power of ⅈ.

If we stop here, we obtain acos(0.6) ≈ 13/32*π/2 = 0.6381360077604268 (The "exact" value is 0.6435011087932844.)

The accuracy, of course, depends on the number of iterations. For a quick-and-dirty approximation, use 10 iterations. For "intense precision", use 50-60 iterations.

Baryta answered 31/1, 2012 at 2:32 Comment(0)

A fast arccosine implementation, accurate to about 0.5 degrees, can be based on the observation that for x in [0,1], acos(x) ≈ √(2*(1-x)). An additional scale factor improves accuracy near zero. The optimal factor can be found by a simple binary search. Negative arguments are handled according to acos (-x) = π - acos (x).

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>

// Approximate acos(a) with relative error < 5.15e-3
// This uses an idea from Robert Harley's posting in comp.arch.arithmetic on 1996/07/12
// https://groups.google.com/forum/#!original/comp.arch.arithmetic/wqCPkCCXqWs/T9qCkHtGE2YJ
float fast_acos (float a)
{
    const float PI = 3.14159265f;
    const float C  = 0.10501094f;
    float r, s, t, u;
    t = (a < 0) ? (-a) : a;  // handle negative arguments
    u = 1.0f - t;
    s = sqrtf (u + u);
    r = C * u * s + s;  // or fmaf (C * u, s, s) if FMA support in hardware
    if (a < 0) r = PI - r;  // handle negative arguments
    return r;
}

float uint_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof(r));
    return r;
}

int main (void)
{
    double maxrelerr = 0.0;
    uint32_t a = 0;
    do {
        float x = uint_as_float (a);
        float r = fast_acos (x);
        double xx = (double)x;
        double res = (double)r;
        double ref = acos (xx);
        double relerr = (res - ref) / ref;
        if (fabs (relerr) > maxrelerr) {
            maxrelerr = fabs (relerr);
            printf ("xx=% 15.8e  res=% 15.8e  ref=% 15.8e  rel.err=% 15.8e\n",
                    xx, res, ref, relerr);
        }
        a++;
    } while (a);
    printf ("maximum relative error = %15.8e\n", maxrelerr);
    return EXIT_SUCCESS;
}

The output of the above test scaffold should look similar to this:

xx= 0.00000000e+000  res= 1.56272149e+000  ref= 1.57079633e+000  rel.err=-5.14060021e-003
xx= 2.98023259e-008  res= 1.56272137e+000  ref= 1.57079630e+000  rel.err=-5.14065723e-003
xx= 8.94069672e-008  res= 1.56272125e+000  ref= 1.57079624e+000  rel.err=-5.14069537e-003
xx=-2.98023259e-008  res= 1.57887137e+000  ref= 1.57079636e+000  rel.err= 5.14071269e-003
xx=-8.94069672e-008  res= 1.57887149e+000  ref= 1.57079642e+000  rel.err= 5.14075044e-003
maximum relative error = 5.14075044e-003

Univalence answered 8/1, 2018 at 20:24 Comment(0)

Here is a great website with many options: https://www.ecse.rpi.edu/Homepages/wrf/Research/Short_Notes/arcsin/onlyelem.html

Personally I went the Chebyshev-Pade quotient approximation with with the following code:

double arccos(double x) {
const double pi = 3.141592653;
    return pi / 2 - (.5689111419 - .2644381021*x - .4212611542*(2*x - 1)*(2*x - 1)
         + .1475622352*(2*x - 1)*(2*x - 1)*(2*x - 1))
         / (2.006022274 - 2.343685222*x + .3316406750*(2*x - 1)*(2*x - 1) +
             .02607135626*(2*x - 1)*(2*x - 1)*(2*x - 1));
}

Bolger answered 20/11, 2016 at 10:35 Comment(1)

This is way off at x = -1, like by 0.5rad. Not usable. – Berber 1/12, 2017 at 3:29

If you're using Microsoft VC++, here's an inline __asm x87 FPU code version without all the CRT filler, error checks, etc. and unlike the earliest classic ASM code you can find, it uses a FMUL instead of the slower FDIV. It compiles/works with Microsoft VC++ 2005 Express/Pro what I always stick with for various reasons.

It's a little tricky to setup a function with "__declspec(naked)/__fastcall", pull parameters correctly, handle stack, so not for the faint of heart. If it fails to compile with errors on your version, don't bother unless you're experienced. Or ask me, I can rewrite it in a slightly friendlier __asm{} block. I would manually inline this if it's a critical part of a function in a loop for further performance gains if need be.

extern float __fastcall fs_acos(float x);
extern double __fastcall fs_Acos(double x);

// ACOS(x)- Computes the arccosine of ST(0)
// Allowable range: -1<=x<=+1
// Derivative Formulas: acos(x) = atan(sqrt((1 - x * x)/(x * x))) OR
// acos(x) = atan2(sqrt(1 - x * x), x)
// e.g. acos(-1.0) = 3.1415927

__declspec(naked) float __fastcall fs_acos(float x) { __asm {
    FLD   DWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
    FLD1            ;// Load 1.0
    FADD  ST, ST(1) ;// Compute 1.0 + 'x'
    FLD1            ;// Load 1.0
    FSUB  ST, ST(2) ;// Compute 1.0 - 'x'
    FMULP ST(1), ST ;// Compute (1-x) * (1+x)
    FSQRT           ;// Compute sqrt(result)
    FXCH  ST(1)
    FPATAN          ;// Compute arctangent of result / 'x' (ST1/ST0)
    RET 4
}}

__declspec(naked) double __fastcall fs_Acos(double x) { __asm { //
    FLD   QWORD PTR [ESP+4] ;// Load/Push parameter 'x' to FPU stack
    FLD1            ;// Load 1.0
    FADD  ST, ST(1) ;// Compute (1.0 + 'x')
    FLD1            ;// Load 1.0
    FSUB  ST, ST(2) ;// Compute (1.0 - 'x')
    FMULP ST(1), ST ;// Compute (1-x) * (1+x)
    FSQRT           ;// Compute sqrt((1-x) * (1+x))
    FXCH  ST(1) 
    FPATAN          ;// Compute arctangent of result / 'x' (ST1/ST0)
    RET 8
}}

Loftis answered 30/7, 2019 at 19:6 Comment(2)

I doubt that the FPU will be faster than SSE instructions, also, it's unusable for x64 targets since MSVC won't allow inlined asm blocks for such targets – Axiomatic 16/11, 2020 at 23:44

It is advisable to use the _fastcall directive with a parameter declared as const float &x or const double &x. Then only address of x in the ecx register is passed to the function. This is more efficient than creating a copy of the argument on the stack. And, of course, this code is quite slow, so it is advisable to use it only if high precision is needed (extended or at least double) or the compactness of the code is needed. – Leitman 20/3 at 14:48

Unfortunately I do not have enough reputation to comment. Here is a small modification of Nvidia's function, that deals with the fact that numbers that should be <= 1 while preserving performance as much as possible.

It may be important since rounding errors can lead number that should be 1.0 to be (oh so slightly) larger than 1.0.


double safer_acos(double x) {
  double negate = double(x < 0);
  x = abs(x);
  x -= double(x>1.0)*(x-1.0); // <- equivalent to min(1.0,x), but faster
  double ret = -0.0187293;
  ret = ret * x;
  ret = ret + 0.0742610;
  ret = ret * x;
  ret = ret - 0.2121144;
  ret = ret * x;
  ret = ret + 1.5707288;
  ret = ret * sqrt(1.0-x);
  ret = ret - 2 * negate * ret;
  return negate * 3.14159265358979 + ret;

  // In a single line (no gain using gcc)
  //return negate * 3.14159265358979 + (((((-0.0187293*x)+ 0.0742610)*x - 0.2121144)*x + 1.5707288)* sqrt(1.0-x))*(1.0-2.0*negate);

}

Evaevacuant answered 23/1, 2020 at 14:40 Comment(0)

The code is intended for Visual Studio C++ and is based on my proposed fast approximation of the arctangent using the Padé-Chebyshev method (see Fast & accurate atan/arctan approximation algorithm). The function provides almost complete single-precision format accuracy, only slightly inferior in accuracy to the standard acosf() function, but significantly faster (more than fourfold in my system). The function calculates arccos x = 2 arctg ((1-x)/(1+x))^0.5. The maximum absolute error is about 4.7e-7 radians. The root mean square relative error is about 3.8e-8. It should be noted that the approximation error does not have a systematic component; the error is almost entirely due to rounding errors of intermediate calculation results.

_declspec(naked) float _vectorcall arccos(float x)
{
  static const float ct[11] =   // Constants table
  {
    2.26527548f,                // a0*(a1+b0)=a0*c1
    3.50788045f,                // 2*a0*a1
    0.0810196251f,              // a0*(a3+b2)=a0*c3
    0.0289801192f,              // 2*a0*a3
    1.53400576f,                // a0
    3.06801152f,                // 2*a0
    0.901376426f,               // a0*(a2+b1)=a0*c2
    0.906169057f,               // 2*a0*a2
    0.141592654f,               // pi-3
    3.0f,                       // 3
    1.0f                        // 1
  };
  _asm
  {
    mov edx,offset ct           // edx contains the address of constants table
    vmovss xmm1,[edx+40]        // xmm1 = 1
    vsubss xmm2,xmm1,xmm0       // xmm2 = 1-x
    vaddss xmm0,xmm0,xmm1       // xmm0 = 1+x
    vdivss xmm2,xmm2,xmm0       // xmm2 = (1-x)/(1+x) = y
    vmovups xmm3,[edx+16]       // xmm3 = 2*a0*a2 # a0*c2 : 2*a0 # a0
    vsqrtss xmm0,xmm2,xmm2      // xmm0 = y^0.5 = t
    vshufps xmm2,xmm2,xmm2,0    // xmm2 = y # y : y # y
    vmulps xmm4,xmm2,xmm2       // xmm4 = y^2 # y^2 : y^2 # y^2
    vucomiss xmm2,xmm1          // Compare y with 1
    ja arccos_m                 // Goto if x<0 (y>1) and x<>NaN
    vfmadd231ps xmm3,xmm2,[edx] // xmm3 ~ 2*(a3*y+a2) # c3*y+c2 : 2*(a1*y+1) # c1*y+1
    vmovhlps xmm1,xmm1,xmm3     // xmm1 ~ 2*(a3*y+a2) # c3*y+c2
    vfmadd231ps xmm3,xmm4,xmm1  // xmm3 ~ 2*(a3*y^3+a2*y^2+a1*y+1) # c3*y^3+c2*y^2+c1*y+1
    vmovshdup xmm2,xmm3         // xmm2 = 2*P; xmm3 = Q
    vdivss xmm2,xmm2,xmm3       // xmm2 = 2*P/Q
    vmulss xmm0,xmm0,xmm2       // xmm0 = 2*t*P/Q = 2*arctg(t) = arccos(x)
    ret                         // Return
      arccos_m:                 // For t>1 use the formula pi-2*arctg(1/t)
    vmovss xmm1,[edx+32]        // xmm1 = pi-3
    vucomiss xmm0,xmm2          // Compare t with y to reveal infinity (case x=-1)
    jnb arccos_end              // If x=-1 the required data is already in xmm1
    vfmadd213ps xmm3,xmm2,[edx] // xmm3 ~ 2*(a2*y+a3) # c2*y+c3 : 2*(y+a1) # y+c1
    vmovhlps xmm2,xmm2,xmm3     // xmm2 ~ 2*(a2*y+a3) # c2*y+c3
    vfmadd213ps xmm3,xmm4,xmm2  // xmm3 ~ 2*(y^3+a1*y^2+a2*y+a3) # y^3+c1*y^2+c2*y+c3
    vmovshdup xmm4,xmm3         // xmm4 = 2*P; xmm3 = Q
    vmulss xmm0,xmm0,xmm3       // xmm0 = t*Q
    vfmsub132ss xmm1,xmm4,xmm0  // xmm1 = (pi-3)*t*Q-2*P
    vdivss xmm1,xmm1,xmm0       // xmm1 = (pi-3)-2*P/(t*Q)
      arccos_end:               // Add 3 to result
    vaddss xmm0,xmm1,[edx+36]   // xmm0 = pi-2*P/(x*Q) = 2*arctg(t) = arccos(x)
    ret                         // Return
  }
}

Leitman answered 20/3 at 2:10 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags