Is there an accurate approximation of the acos() function?
Asked Answered
M

4

6

I need an acos() function with double precision within a compute shader. Since there is no built-in function of acos() in GLSL with double precision, I tried to implement my own.

At first, I implemented a Taylor series like the equation from Wiki - Taylor series with precalculated faculty values. But that seems to be inaccurate around 1. The maximum error was something about 0.08 with 40 iterations.

I also implemented this method which works very well on CPU with a maximum error of -2.22045e-16, but I have some troubles to implement this within the shader.

Currently, I am using an acos() approximation function from here where somebody posted his approximation functions on this site. I am using the most accurate function of this site and now I get a maximum error of -7.60454e-08, but also that error is too high.

My code of this function is:

double myACOS(double x)
{
    double part[4];
    part[0] = 32768.0/2835.0*sqrt(2.0-sqrt(2.0+sqrt(2.0+sqrt(2.0+2.0*x))));
    part[1] = 256.0/135.0*sqrt(2.0-sqrt(2.0+sqrt(2.0+2.0*x)));
    part[2] = 8.0/135.0*sqrt(2.0-sqrt(2.0+2.0*x));
    part[3] = 1.0/2835.0*sqrt(2.0-2.0*x);
    return (part[0]-part[1]+part[2]-part[3]);
}

Does anybody know another implementation method of acos() which is very accurate and -if possible- easy to implement in a shader?

Some system information:

  • Nvidia GT 555M
  • running OpenGL 4.3 with optirun
Mahau answered 10/3, 2015 at 16:33 Comment(19)
why do you need acos? if it is for slerp you can divide and conquer with repeated lerpsTurnout
there is a standard acos in <cmath>Edgardo
Holy crap, use a lookup table if you need that many sqrts.Pressmark
@Edgardo cmath is not available in a glsl shaderTurnout
@AndonM.Coleman I would use one, but this was just for the proof of concept. And as I said accuracy is more important then performance.Mahau
I'm wondering whether these approximations are actually faster... Did you ran any benchmarks?Letti
why is this tagged C++ when the language is GLSL?Marxist
@CommuSoft I am not interested in the performance, so I didn't. I just tested the accuracy on the CPU against the implementation of acos() from math.h.Mahau
What are you using the returned angle for 9 out of 10 you can get to the result you want without acosTurnout
What's the range of arguments you need to support? [-1.0, 1.0]?Aphis
@RetoKoradi Yes. From -1.0 to 1.0.Mahau
@ratchetfreak To solve a complex equation in a kind of MLS algorithm.Mahau
What's your target hardware then? GPU with Compute 1.3 or higher - NVIDIA cards GT200 and up, support double precision arithmetic. I assume these functions have double precision too.Handmaiden
@Handmaiden I added the information to the question. In OpenCL and CUDA are double precision functions for 'acos()', but not within a shader. Unfortunately I am a bit restricted to compute shaders.Mahau
"I have some troubles to implement this within the shader."...such as?Mcglothlin
Perhaps, (1) asin(x) = x + 1/2 (x^3/3) + (1/2)(3/4)(x^5/5) + (1/2)(3/4)(5/6)(x^7/7) + ... (2) acos(x) = (pi/2 - asin(x)). As a quick test, I got asin = 0.89726889283942934 with the stdlib giving 0.90384998229123237. For acos(x) I got 0.67415967858914205 with the stdlib giving 0.66694634450366430. Not sure how many iterations for it to converge to the kind of precision you want.Handmaiden
To add, there's a different way of calculating it around 1 because as you say it's inaccurate. There's a question here about it: #20196500Handmaiden
@Handmaiden Your first comment is exactly the Taylor series Ialready implemented. But i will have a look at the other question.Mahau
A different route is to express acos in terms of atan; not sure that's better.Acidic
M
3

My current accurate shader implementation of 'acos()' is a mix out of the usual Taylor series and the answer from Bence. With 40 iterations I get an accuracy of 4.44089e-16 to the 'acos()' implementation from math.h. Maybe it is not the best, but it works for me:

And here it is:

double myASIN2(double x)
{
    double sum, tempExp;
    tempExp = x;
    double factor = 1.0;
    double divisor = 1.0;
    sum = x;
    for(int i = 0; i < 40; i++)
    {
        tempExp *= x*x;
        divisor += 2.0;
        factor *= (2.0*double(i) + 1.0)/((double(i)+1.0)*2.0);
        sum += factor*tempExp/divisor;
    }
    return sum;
}

double myASIN(double x)
{
    if(abs(x) <= 0.71)
        return myASIN2(x);
    else if( x > 0)
        return (PI/2.0-myASIN2(sqrt(1.0-(x*x))));
    else //x < 0 or x is NaN
        return (myASIN2(sqrt(1.0-(x*x)))-PI/2.0);

}

double myACOS(double x)
{
    return (PI/2.0 - myASIN(x));
}

Any comments, what could be done better? For example using a LUT for the values of factor, but in my shader 'acos()' is just called once so there is no need for it.

Mahau answered 11/3, 2015 at 13:43 Comment(0)
U
9

The NVIDIA GT 555M GPU is a device with compute capability 2.1, so there is native hardware support for basic double-precision operations, including fused multipy-add (FMA). As with all NVIDIA GPUs, the square root operation is emulated. I am familiar with CUDA, but not GLSL. According to version 4.3 of the GLSL specification, it exposes double-precision FMA as a function fma() and provides a double-precision square root, sqrt(). It is not clear whether the sqrt() implementation is correctly rounded according to IEEE-754 rules. I will assume it is, by analogy with CUDA.

Instead of using a Taylor series, one would want to use a polynomial minimax approximation, thereby reducing the number of terms required. Minimax approximations are typically generated using a variant of the Remez algorithm. To optimize both speed and accuracy, the use of FMA is essential. Evaluation of the polynomial with the Horner scheme is condusive to high accruacy. In the code below, a second order Horner scheme is utilized. As in DanceIgel's answer, acos is conveniently computed using an asin approximation as the basic building block in conjunction with standard mathematical identities.

With 400M test vectors, the maximum relative error seen with the code below was 2.67e-16, while the maximum ulp error observed is 1.442 ulp.

/* compute arcsin (a) for a in [-9/16, 9/16] */
double asin_core (double a)
{
    double q, r, s, t;

    s = a * a;
    q = s * s;
    r =             5.5579749017470502e-2;
    t =            -6.2027913464120114e-2;
    r = fma (r, q,  5.4224464349245036e-2);
    t = fma (t, q, -1.1326992890324464e-2);
    r = fma (r, q,  1.5268872539397656e-2);
    t = fma (t, q,  1.0493798473372081e-2);
    r = fma (r, q,  1.4106045900607047e-2);
    t = fma (t, q,  1.7339776384962050e-2);
    r = fma (r, q,  2.2372961589651054e-2);
    t = fma (t, q,  3.0381912707941005e-2);
    r = fma (r, q,  4.4642857881094775e-2);
    t = fma (t, q,  7.4999999991367292e-2);
    r = fma (r, s, t);
    r = fma (r, s,  1.6666666666670193e-1);
    t = a * s;
    r = fma (r, t, a);

    return r;
}

/* Compute arccosine (a), maximum error observed: 1.4316 ulp
   Double-precision factorization of π courtesy of Tor Myklebust
*/
double my_acos (double a)
{
    double r;

    r = (a > 0.0) ? -a : a; // avoid modifying the "sign" of NaNs
    if (r > -0.5625) {
        /* arccos(x) = pi/2 - arcsin(x) */
        r = fma (9.3282184640716537e-1, 1.6839188885261840e+0, asin_core (r));
    } else {
        /* arccos(x) = 2 * arcsin (sqrt ((1-x) / 2)) */
        r = 2.0 * asin_core (sqrt (fma (0.5, r, 0.5)));
    }
    if (!(a > 0.0) && (a >= -1.0)) { // avoid modifying the "sign" of NaNs
        /* arccos (-x) = pi - arccos(x) */
        r = fma (1.8656436928143307e+0, 1.6839188885261840e+0, -r);
    }
    return r;
}
Unification answered 8/3, 2017 at 22:51 Comment(0)
M
3

My current accurate shader implementation of 'acos()' is a mix out of the usual Taylor series and the answer from Bence. With 40 iterations I get an accuracy of 4.44089e-16 to the 'acos()' implementation from math.h. Maybe it is not the best, but it works for me:

And here it is:

double myASIN2(double x)
{
    double sum, tempExp;
    tempExp = x;
    double factor = 1.0;
    double divisor = 1.0;
    sum = x;
    for(int i = 0; i < 40; i++)
    {
        tempExp *= x*x;
        divisor += 2.0;
        factor *= (2.0*double(i) + 1.0)/((double(i)+1.0)*2.0);
        sum += factor*tempExp/divisor;
    }
    return sum;
}

double myASIN(double x)
{
    if(abs(x) <= 0.71)
        return myASIN2(x);
    else if( x > 0)
        return (PI/2.0-myASIN2(sqrt(1.0-(x*x))));
    else //x < 0 or x is NaN
        return (myASIN2(sqrt(1.0-(x*x)))-PI/2.0);

}

double myACOS(double x)
{
    return (PI/2.0 - myASIN(x));
}

Any comments, what could be done better? For example using a LUT for the values of factor, but in my shader 'acos()' is just called once so there is no need for it.

Mahau answered 11/3, 2015 at 13:43 Comment(0)
T
1

The "problem" with acos is that it has singularities (branch points) at ±1, and for that reason Taylor or Padé won't approximate well. The same applies to MiniMax approximations that try to minimize the max-norm when the domain contains a singularity.

The singularities are basically square-roots, so they can be factored out and the remaining function is smooth and approximates well. The remaining function is asinQ in the C99 code that follows, it's a MiniMax rational approximation computed with the Remez algorithm. (The code is GNU-C99 actually so M_PI is available).

Notes on the code:

  • The approximation minimizes the relative error because it's about floating point.

  • It uses sqrt (square root), fma (float multiply-add), ldexp (scale by a power of 2)

  • It uses if-else and conditional assignment. I am not familiar with shaders and whether that ruins the performance. If that's the case, maybe the code can be reorganized. Each path eveluates asinQ for some value 0 ≤ x ≤ 1/2.

#include <math.h>
#include <stdio.h>

static const double P[] =
{
    -5254.7508920534192751630665,
    +6236.4128689522053163220058,
    -2597.2384260994230261835544,
    +445.05537923135581032176275,
    -27.342673770163272135013850,
    +0.30325426599090297962612977
};

static const double Q[] =
{
    -5254.7508920534192747096309,
    +6674.3087766233233966852037,
    -3054.9042449253514128522165,
    +603.81083008747677276795370,
    -47.647718147243950851054008
    // 1.0
};

// Approximate arcsin(sqrt(x/2)) / sqrt(x/2) as MiniMax [5/5] over [0, 1/2].
static inline double asinQ (double x)
{
    double p = fma (P[5], x, P[4]);
    p = fma (p, x, P[3]);
    p = fma (p, x, P[2]);
    p = fma (p, x, P[1]);
    p = fma (p, x, P[0]);

    double q = Q[4] + x; // Q[5] = 1.0
    q = fma (q, x, Q[3]);
    q = fma (q, x, Q[2]);
    q = fma (q, x, Q[1]);
    q = fma (q, x, Q[0]);

    return p / q;
}

double my_acos (double x)
{
    double x_abs = x > 0.0 ? x : -x;

    if (x_abs <= 0.5)
    {
        return M_PI/2 - x * asinQ (ldexp (x*x, 1));
    }
    else
    {
        double x1 = 1.0 - x_abs;
        double ret = sqrt (ldexp (x1, 1)) * asinQ (x1);
        return x > 0.0 ? ret : M_PI - ret;
    }
}

int main (int argc, char *argv[])
{
    double max_err = -1.0;
    double min_bits = 0;
    double max_x = 0;
    int ex = 8;
    if (argc >= 2)
        sscanf (argv[1], "%i", &ex);

    for (double x = -1.0; x <= 1.0; x += ldexp (1, -ex))
    {
        double err = (my_acos (x) - acos(x)) / acos (x);
        double bits = - log2 (fabs (err));
        //printf ("%+.6f: % 6e (%.2f bits)\n", x, err, bits);
        if (fabs (err) > max_err)
            max_err = fabs (err), min_bits = bits, max_x = x;
    }
    printf ("worst (%ld values): x=%+.6f, err=%6e, bits=%.2f\n",
            1 + (2L << ex), max_x, max_err, min_bits);
    return 0;
}

Running this for 16 Mio values prints

worst (16777217 values): x=+0.839781, err=5.803408e-16, bits=50.61

So the maximal relative error is around 6·10−16 which is a precision of around 50.5 bits. (The absolute error for that value is 3.4·10−16.) Assuming acos and sqrt introduce an error of 0.5 ULPs, the precision of my_acos is around 1.5 ULPs.

Ting answered 29/8, 2023 at 17:46 Comment(0)
C
0

Maybe this solution helps. It is better than 1% of the correct angle for x = 1 to 0.2.

acos(x) ~= sqrt(2) * ( sqrt(1-x) + (1/11) * (sqrt(1-x))^3 )

This started with a Taylor series provided by Wolfram. That required too many terms for even a rough value below 0.8. This method used the general form of the first 2 terms but changed the coeffs to improve the match. Interesting that an integer coeff of an integer 11 worked.

Cuneiform answered 17/1, 2023 at 0:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.