Payne Hanek algorithm implementation in C

Asked 26/5, 2015 at 16:4 Answered 26/5, 2015 at 18:4

Solved c algorithm floating-point floating-accuracy

I'm struggling to understand how TO IMPLEMENT the range reduction algorithm published by Payne and Hanek (range reduction for trigonometric functions)

I've seen there's this library: http://www.netlib.org/fdlibm/

But it looks to me so twisted, and all the theoretical explanation i've founded are too simple to provide an implementation.

Is there some good... good... good explanation of it?

Puccoon answered 26/5, 2015 at 16:4 Comment(2)

It seems to me like you are expecting us to do your implementation for you? – Tbilisi 26/5, 2015 at 16:6

Nono, but if you know some phd thesis or i don't know whatever useful to understand better how to implement it. – Puccoon 26/5, 2015 at 16:10

Performing argument reduction for trigonometric functions via the Payne-Hanek algorithm is actually pretty straightforward. As with other argument reduction schemes, compute n = round_nearest (x / (π/2)), then compute remainder via x - n * π/2. Better efficiency is achieved by computing n = round_nearest (x * (2/π)).

The key observation in Payne-Hanek is that when computing the remainder of x - n * π/2 using the full unrounded product the leading bits cancel during subtraction, so we do not need to compute those. We are left with the problem of finding the right starting point (non-zero bits) based on the magnitude of x. If x is close to a multiple of π/2, there may be additional cancellation, which is limited. One can consult the literature for an upper bound on the number of additional bits that cancel in such cases. Due to relatively high computational cost, Payne-Hanek is usually only used for arguments large in magnitude, which has the additional benefit that during subtraction the bits of the original argument x are zero in the relevant bit positions.

Below I show exhaustively tested C99 code for single-precision sinf() that I wrote recently that incorporates Payne-Hanek reduction in the slow path of the reduction, see trig_red_slowpath_f(). Note that in order to achieve a faithfully rounded sinf() one would have to augment the argument reduction to return the reduced argument as two float operands in head/tail fashion.

Various design choices are possible, below I opted for largely integer-based computation in order to minimize the storage for the required bits of 2/π. Implementations using floating-point computation and overlapped pairs or triples of floating-point numbers to store the bits of 2/π are also common.

/* 190 bits of 2/pi for Payne-Hanek style argument reduction. */
static const unsigned int two_over_pi_f [] = 
{
    0x00000000,
    0x28be60db,
    0x9391054a,
    0x7f09d5f4,
    0x7d4d3770,
    0x36d8a566,
    0x4f10e410
};

float trig_red_slowpath_f (float a, int *quadrant)
{
    unsigned long long int p;
    unsigned int ia, hi, mid, lo, i;
    int e, q;
    float r;

    ia = (unsigned int)(fabsf (frexpf (a, &e)) * 0x1.0p32f);

    /* extract 96 relevant bits of 2/pi based on magnitude of argument */ 
    i = (unsigned int)e >> 5;
    e = (unsigned int)e & 31;

    if (e) {
        hi  = (two_over_pi_f [i+0] << e) | (two_over_pi_f [i+1] >> (32 - e));
        mid = (two_over_pi_f [i+1] << e) | (two_over_pi_f [i+2] >> (32 - e));
        lo  = (two_over_pi_f [i+2] << e) | (two_over_pi_f [i+3] >> (32 - e));
    } else {
        hi  = two_over_pi_f [i+0];
        mid = two_over_pi_f [i+1];
        lo  = two_over_pi_f [i+2];
    }

    /* compute product x * 2/pi in 2.62 fixed-point format */
    p = (unsigned long long int)ia * lo;
    p = (unsigned long long int)ia * mid + (p >> 32);
    p = ((unsigned long long int)(ia * hi) << 32) + p;

    /* round quotient to nearest */
    q = (int)(p >> 62);                // integral portion = quadrant<1:0>
    p = p & 0x3fffffffffffffffULL;     // fraction
    if (p & 0x2000000000000000ULL) {   // fraction >= 0.5
        p = p - 0x4000000000000000ULL; // fraction - 1.0
        q = q + 1;
    }

    /* compute remainder of x / (pi/2) */
    double d;

    d = (double)(long long int)p;
    d = d * 0x1.921fb54442d18p-62; // 1.5707963267948966 * 0x1.0p-62
    r = (float)d;
    if (a < 0.0f) {
        r = -r;
        q = -q;
    }

    *quadrant = q;
    return r;
}

/* Like rintf(), but -0.0f -> +0.0f, and |a| must be <= 0x1.0p+22 */
float quick_and_dirty_rintf (float a)
{
    float cvt_magic = 0x1.800000p+23f;
    return (a + cvt_magic) - cvt_magic;
}

/* Argument reduction for trigonometric functions that reduces the argument
   to the interval [-PI/4, +PI/4] and also returns the quadrant. It returns 
   -0.0f for an input of -0.0f 
*/
float trig_red_f (float a, float switch_over, int *q)
{    
    float j, r;

    if (fabsf (a) > switch_over) {
        /* Payne-Hanek style reduction. M. Payne and R. Hanek, Radian reduction
           for trigonometric functions. SIGNUM Newsletter, 18:19-24, 1983
        */
        r = trig_red_slowpath_f (a, q);
    } else {
        /* FMA-enhanced Cody-Waite style reduction. W. J. Cody and W. Waite, 
           "Software Manual for the Elementary Functions", Prentice-Hall 1980
        */
        j = 0x1.45f306p-1f * a;             // 2/pi
        j = quick_and_dirty_rintf (j);
        r = fmaf (j, -0x1.921fb0p+00f, a);  // pio2_high
        r = fmaf (j, -0x1.5110b4p-22f, r);  // pio2_mid
        r = fmaf (j, -0x1.846988p-48f, r);  // pio2_low
        *q = (int)j;
    }
    return r;
}

/* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error = 0.64721
   Returns -0.0f for an argument of -0.0f
   Polynomial approximation based on unpublished work by T. Myklebust
*/
float sinf_poly (float a, float s)
{
    float r;

    r =              0x1.7d3bbcp-19f;
    r = fmaf (r, s, -0x1.a06bbap-13f);
    r = fmaf (r, s,  0x1.11119ap-07f);
    r = fmaf (r, s, -0x1.555556p-03f);
    r = r * s + 0.0f; // ensure -0 is passed trough
    r = fmaf (r, a, a);
    return r;
}

/* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error = 0.87531 */
float cosf_poly (float s)
{
    float r;

    r =              0x1.98e616p-16f;
    r = fmaf (r, s, -0x1.6c06dcp-10f);
    r = fmaf (r, s,  0x1.55553cp-05f);
    r = fmaf (r, s, -0x1.000000p-01f);
    r = fmaf (r, s,  0x1.000000p+00f);
    return r;
}

/* Map sine or cosine value based on quadrant */
float sinf_cosf_core (float a, int i)
{
    float r, s;

    s = a * a;
    r = (i & 1) ? cosf_poly (s) : sinf_poly (a, s);
    if (i & 2) {
        r = 0.0f - r; // don't change "sign" of NaNs
    }
    return r;
}

/* maximum ulp error = 1.49241 */
float my_sinf (float a)
{
    float r;
    int i;

    a = a * 0.0f + a; // inf -> NaN
    r = trig_red_f (a, 117435.992f, &i);
    r = sinf_cosf_core (r, i);
    return r;
}

/* maximum ulp error = 1.49510 */
float my_cosf (float a)
{
    float r;
    int i;

    a = a * 0.0f + a; // inf -> NaN
    r = trig_red_f (a, 71476.0625f, &i);
    r = sinf_cosf_core (r, i + 1);
    return r;
}

Trochee answered 26/5, 2015 at 18:4 Comment(25)

Why choose π/2 instead of π/4? (i can't find the original paper of the payne-hanek algorithm). – Puccoon 27/5, 2015 at 8:17

Note that the rounding of the quotient here is to-nearest, causing the magnitude of the arguments passed to the core approximations to be <= π/4. So only two core approximations are required, and they can be used for both cosf() and sinf(). I tried to keep the code as simple as possible in hopes of making it easy to read and easy to understand the mechanics of the algorithm. For the same reason I did not target a faithfully rounded function, which would require a double-float output from the reduction process. – Trochee 27/5, 2015 at 8:38

I'm studying your code. By the way... for core approximation i guess you mean approximation of sin and cos using the poly approximation isn't? – Puccoon 27/5, 2015 at 8:47

By "core approximations" I meant the the two polynomial minimax approximations on the primary approximation interval [- π/4, + π/4], correct. I have now added the cosf() implementation to demonstrate that it simply re-uses the code and approximations used for sinf(). – Trochee 27/5, 2015 at 8:52

I'm trying to follow your code using the original paper i've just found, if you have the same paper is that a problem if i comment your code following that paper? So you can confirm i understood well or not. – Puccoon 27/5, 2015 at 9:4

Ok... so anyway it looks like to me that there's some degree of freedom in this algorithm implementation, probably it's related to the required precision... so first of all why are you using right 190 bits? and why do you drawn from them right 96 bits? Why do you use the fixed point arithmetic for the product? – Puccoon 27/5, 2015 at 9:50

@Lukkio My code does not slavishly follow the original paper in every detail, but essentially it is the same algorithm. I found the original paper very hard to understand when I first worked through it many years ago. I boldly claim my code above is easier to follow than the paper. The number of bits of 2/π (or equivalent constant) required is addressed in the paper, very roughly we need 128 (powers of 2 covered by float when using fixed point) + 26 (to produce a rounded final result) + 30 (maximum additional bits that cancel near multiples of π/2 when arguments is a float). – Trochee 27/5, 2015 at 15:58

Why extract 96 bits of π/2? Because 64 bits (two chunks) are insufficient to produce an accurate result. Leading bits can cancel near multiples of π/2, and we need trailing bits to avoid excessive round-off. Some implementations distinguish two cases depending on how much cancellation there is, for simplicity I am not doing it here. I have done a lot of work on SIMD and SIMT, and "branches are evil" informs my coding style. I have not examined how many bits we really need exactly. It is more than 64 but less than 96. Since the code operates in chunks of 32 bits, 96 it is. – Trochee 27/5, 2015 at 16:4

Some empirical data with regard to the above code: The maximum number of additional bits cancelling near multiples of π/2 is 29. The number of extracted bits of 2/π must be >= 86 to maintain the stated ulp error boundaries for sinf() and cosf(). Note that the fixed-point arithmetic does not round, but simply truncates, which increases round-off error. – Trochee 27/5, 2015 at 19:8

Hi again, sorry to bother you again, i was studying your code following the documentation i have and i was wandering how to choose the optimal bits number to compute the product x*(2/pi). As you said your code is a "basic version" of the whole algorithm, but if i wanted to implement a "robust version" should i probably implement an error analysis in line? The error of this range reduction is evaluated using the continous fraction theory, so basically when i pack your variables "hi,mid,low" maybe i should use a precision parameter "p" drawned by continous fraction error analysis, isn't? – Puccoon 31/5, 2015 at 12:16

I am not sure what you mean by "robust". The code I posted was tested exhaustively. For the sake of clarity I simplified as much as possible to make it easier to understand the constituent steps. If you want to return the reduced argument as a double-float for the purpose of achieving a faithfully-rounded sinf(), you can certainly do this, and the 96 bits extracted from 2/π are sufficient for that. If you want to construct a double-precision version, you would want to use three chunks of 64 bits instead. For double precision near multiple of π/2, at most 62 additional bits cancel as I recall – Trochee 31/5, 2015 at 15:29

Ok i'll try to work on it. Thank you, i'll let you know. – Puccoon 1/6, 2015 at 9:56

Hi again, may i ask you how did you derive the inverse of Pi? – Puccoon 23/6, 2015 at 9:23

@Lukkio: Arbitrary precision arithmetic package. In particular, R.P. Brent's MP from 1978. – Trochee 23/6, 2015 at 10:46

I have a question, in your multiword representation of 2/Pi did you use the fixed point representation? You said you used your representation in order to minimize the storage requirement for the constant. But i wander if there's some advantages in using floating point representation instead (like fdlibm does for example). – Puccoon 23/6, 2015 at 15:1

Which approach works the best will be best depends on your individual circumstances, so you will need to make your own decision. I chose fixed-point arithmetic here for the most compact representation of the long constant, as already stated. – Trochee 23/6, 2015 at 15:21

My circumstances are "trade-off bandwidth/accuracy". For example, assuming i want to implement the algorithm for 16 bit IEEE system, with specific emphasis to the accuracy issue which choose could fit better? I still don't get very well how should i choose the size of the LUT for the inverse pi, the precision of each entry etc. – Puccoon 23/6, 2015 at 15:26

Trade-offs presumably also include other aspects of the CPU, eg. relative throughput of integer and floating-point units? I think I explained before that number of bits required for the constant are roughly 2**[num bits in exponent-1]+[length of mantissa + 2]+[max. bits that cancel during subtraction], where [max. bits that cancel during subtraction] is typically on the order of 1.25*[length of mantissa]. So for IEEE-754 fp16 format that gives us an estimate of 2**4+(11+2)+(1.25*11)= 43 bits. For fp16, you should be able to establish the exact number of bits required by simple exhaustive test. – Trochee 23/6, 2015 at 15:41

But if i wanted to store floating point value instead of fixed point value is the formula you showed still valid? (the one for the bit required i mean). The original paper basically talk about (as example) of the fixed point storage instead of the floating point one. – Puccoon 23/6, 2015 at 16:1

Let us continue this discussion in chat. – Puccoon 23/6, 2015 at 16:9

Hi Again, at last i succeeded in the implementation of the algorithm, i tried to follow the original paper as much as i could, i've done some test and apparently it works fine. I have a final question... you said you tested your implementation (the one you posted here). My question is against what you tested exhaustively your implementation? Did you use a multiple precision library? Which criteria did you use for the test etc... Thank you again... this algorithm has been a serious a pain... – Puccoon 23/7, 2015 at 16:16

@Lukkio I tested these single-precision functions against third-party double-precision implementations. With modern hardware, this can be done exhaustively in a few minutes. For validation, I tracked maximum ulp error. For a double-precision implementation, one can use a multi-precision library as the reference, but the checks cannot be exhaustive unless one dedicates a cluster of machines to it. But the worst case arguments for argument reduction in trigonometric functions have been published in the literature and can certainly be covered, along with many random test vectors. – Trochee 23/7, 2015 at 16:27

@Trochee Why not use directly the Payne & Haneck algorithm for reduction to pi/4 ? Shouldn't dividing the pi/2 array values and the pi/2 * 2^-62 constant by 2 result into pi/4 reduction? – Fosque 27/11, 2019 at 13:16

@Fosque The code above is a Payne-Hanek reduction, and the reduced argument is in [-π/4, +π/4]. – Trochee 27/11, 2019 at 16:59

@Trochee I beg your pardon, you are right! I missed the lines where you check the quotient. – Fosque 27/11, 2019 at 18:38

As someone who once tried to implement it, I feel your pain (no pun intended).

Before attempting it, make sure you have a good understanding of when floating point operations are exact, and know how double-double arithmetic works.

Other than that, my best advice is to look at what other smart people have done:

NETLIB: you mentioned it in the question, but this is the file you're interested in. It's a bit confusing, as it tries to do 80-bit long doubles as well.
OS X (from 10.7.5: Apple no longer makes their libm source available): look for ReduceFull
glibc

Tetravalent answered 26/5, 2015 at 17:6 Comment(5)

About the theoretic part i've been looking up the Muller books (2 specifically, one is an handbook of floating point arithmetic and another which is specifically focused on function approximation). Apparently the algorithm is very simple, but i need to understand how an actual implementation should be done. I've also tried yesterday to understand the code of Netlib but it is a pain... so i guess it will take some time for the reverse engineering of the code written by "smart people". – Puccoon 27/5, 2015 at 8:14

Hi there, i'm sorry... but i'm still struggling with this painful algorithm... i've tried to have a look to the glibc and netlib and try to follow the implementation using the original just to understand the meaning of the several variables used especially in the k_rem_pio function... you said you tried to implement it, did you understand well the code you posted? do you mind if i make you some question about that code? – Puccoon 25/6, 2015 at 15:23

I never finished it, and don't have the code available. That said, I think the OS X code is the easiest, and Muller's 2005 book (Elementary Functions) seems to have the best explanation. – Tetravalent 25/6, 2015 at 16:57

I did get around to finishing my implementation gist.github.com/simonbyrne/d640ac1c1db3e1774cf2d405049beef3. It's written in Julia, but it shouldn't be too hard to translate to C if you can take advantage of the checked arithmetic operations (available in gcc or clang) – Tetravalent 9/2, 2017 at 20:56

Thank you! I'll have a look as soon as I can. – Puccoon 9/2, 2017 at 22:28

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags