Efficiently computing (a - K) / (a + K) with improved accuracy

H

6

21

In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K), where a is a positive variable argument and K is a constant. In many cases, K is a power of two, which is the use case relevant to my work. I am looking for efficient ways to compute this quotient more accurately than can be accomplished with the straightforward division. Hardware support for fused multiply-add (FMA) can be assumed, as this operation is provided by all major CPU and GPU architectures at this time, and is available in C/C++ via the functionsfma() and fmaf().

For ease of exploration, I am experimenting with float arithmetic. Since I plan to port the approach to double arithmetic as well, no operations using higher than the native precision of both argument and result may be used. My best solution so far is:

 /* Compute q = (a - K) / (a + K) with improved accuracy. Variant 1 */
 m = a - K;
 p = a + K;
 r = 1.0f / p;
 q = m * r;
 t = fmaf (q, -2.0f*K, m);
 e = fmaf (q, -m, t);
 q = fmaf (r, e, q);

For arguments a in the interval [K/2, 4.23*K], code above computes the quotient almost correctly rounded for all inputs (maximum error is exceedingly close to 0.5 ulps), provided that K is a power of 2, and there is no overflow or underflow in intermediate results. For K not a power of two, this code is still more accurate than the naive algorithm based on division. In terms of performance, this code can be faster than the naive approach on platforms where the floating-point reciprocal can be computed faster than the floating-point division.

I make the following observation when K = 2ⁿ: When the upper bound of the work interval increases to 8*K, 16*K, ... maximum error increases gradually and starts to slowly approximate the maximum error of the naive computation from below. Unfortunately, the same does not appear to be true for the lower bound of the interval. If the lower bound drops to 0.25*K, the maximum error of the improved method above equals the maximum error of the naive method.

Is there a method to compute q = (a - K) / (a + K) that can achieve smaller maximum error (measured in ulp vs the mathematical result) compared to both the naive method and the above code sequence, over a wider interval, in particular for intervals whose lower bound is less than 0.5*K? Efficiency is important, but a few more operations than are used in the above code can likely be tolerated.

In one answer below, it was pointed out that I could enhance accuracy by returning the quotient as an unevaluated sum of two operands, that is, as a head-tail pair q:qlo, i.e. similar to the well-known double-float and double-double formats. In my code above, this would mean changing the last line to qlo = r * e.

This approach is certainly useful, and I had already contemplated its use for an extended-precision logarithm for use in pow(). But it doesn't fundamentally help with the desired widening of the interval on which the enhanced computation provides more accurate quotients. In a particular case I am looking at, I would like to use K=2 (for single precision) or K=4 (for double precision) to keep the primary approximation interval narrow, and the interval for a is roughly [0,28]. The practical problem I am facing is that for arguments < 0.25*K the accuracy of the improved division is not substantially better than with the naive method.

Heartthrob answered 16/2, 2016 at 4:51 Comment(9)

Have you attempted to model the average error curve for your algorithm and add it to the result? – Lisalisabet 16/2, 2016 at 6:19

I am not sure what you mean by "average error curve". I am interested in minimizing the maximum error, measured in ulps. I am determining the error by exhaustive testing over the test interval, which is why I am using single-precision arithmetic for my exploratory work. – Heartthrob 16/2, 2016 at 6:45

I wonder if it's worth looking at the relative errors of: (a / (a + k)) - (k / (a + k)) ? – Sydney 16/2, 2016 at 7:23

@BrettHale Rewriting the expression in this way will cause maximum ulp error to explode, due to subtractive cancellation when a is near K. – Heartthrob 16/2, 2016 at 7:39

if double precision is not prohibitively expensive, then it make sense to use run a fast approximation in higher precision and early-terminate. When K is a power of 2 it's very feasible (because divide by K is cheap). – Amnesia 16/2, 2016 at 13:54

Unfortunately, on some platforms, double operations are much more expensive (as much as 32 times as expensive as float operations). Since I also want to use the same algorithm for double, there are no cheap "quadruple" operations one can use there. Therefore the requirement for only using "native" width operations (which also makes vectorization easier). – Heartthrob 16/2, 2016 at 16:24

Just don't forget that a has an expected error of .5ulp to begin with, and K might, too. – Gilberte 17/2, 2016 at 8:20

@Heartthrob : Rewriting [(a-K)/(a+K) to a/(a+k)-k/(a + k)] will cause [error] to explode - how is the problem of cancellation different after and before rewrite? – Gilberte 17/2, 2016 at 8:24

After the rewrite, the subtraction is between two computed terms that are not free of error. This is not the case for a and K (in the context of this computation). By simple exhaustive test one can easily show that with K=2**n, the maximum ulp error of the naive computation is < 1.625 ulps, but after the proposed re-write the maximum error is in the 100,000s of ulps. Does your own testing show different results? For proper testing, one needs to ensure no higher precision creeps into intermediate computation. I am using the /fp:strict of my compiler, and also mark variables volatile. – Heartthrob 17/2, 2016 at 9:0

H

2

Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a, making use of double-float arithmetic for all intermediate computation seems too costly.

Thinking some more about the problem, it is clear that the computation of the remainder of the division, e in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m to represent (a-K) and represented (a+k) as m + 2*K, as this delivers numerically superior results to the straightforward representation.

With relatively small additional computational cost, (a+K) can be represented as a double-float, that is, a head-tail pair p:plo, which leads to the following modified version of my original code:

/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mx = fmaxf (a, K);
mn = fminf (a, K);
plo = (mx - p) + mn;
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
q = fmaf (r, e, q);

Testing shows that this delivers nearly correctly rounded results for a in [K/2, 2²⁴*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.

Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float head-tail pair m:mlo, which leads to the following code variant:

/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
e = e + mlo;
q = fmaf (r, e, q);

Exhaustive testing hows that this delivers nearly correctly rounded results for a in the interval [K/2²⁴, K*2²⁴). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.

As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p, plo. This approach results in the following code:

/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -2.0f*K, m);
t = fmaf (q, -m, t);
e = fmaf (q - 1.0f, -mlo, t);
q = fmaf (r, e, q);

This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2ⁿ nearly correctly rounded results are produced for values of a in the interval [K/2²⁴, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.

Lastly, one might base the residual computation directly on the original variable a to avoid the error inherent in the computation of m and p. This leads to the following code that, for K = 2ⁿ, computes nearly correctly rounded results for a in the interval [K/2²⁴, K/3):

/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
m = a - K;
p = a + K;
r = 1.0f / p;       
q = m * r;
t = fmaf (q + 1.0f, -K, a);
e = fmaf (q, -a, t);
q = fmaf (r, e, q);

Heartthrob answered 17/2, 2016 at 17:53 Comment(0)

D

4

If a is large compared to K, then (a-K)/(a+K) = 1 - 2K / (a + K) will give a good approximation. If a is small compared to K, then 2a / (a + K) - 1 will give a good approximation. If K/2 ≤ a ≤ 2K, then a-K is an exact operation, so doing the division will give a decent result.

Deflation answered 16/2, 2016 at 17:25 Comment(2)

If you could suggest switchover points between the three suggested code paths, I would be happy to run this through my test framework. While multi-branch code is not necessarily friendly to vectorization and therefore possibly inefficient, in this case that issue may be solvable by predication. – Heartthrob 16/2, 2016 at 18:5

Sorry, I overlooked that the switch-over points are already sufficiently specified. I translated the algorithm into C code as shown below, and find that the maximum ulp error on [0.5*K,4*K) is just a tad under 2.5 ulps, which is larger than with the naive method:

m = a - K; p = a + K; if ((0.5f*K <= a) && (a <= 2.0f*K)) { q = m / p; } else if (a < 0.5f*K) { q = 1.0f - 2.0f*K / p; } else { q = (2.0f * a) / p - 1.0f; }

– Heartthrob 16/2, 2016 at 22:45

V

4

One possibility is to track error of m and p into m1 and p1 with classical Dekker/Schewchuk:

m=a-k;
k0=a-m;
a0=k0+m;
k1=k0-k;
a1=a-a0;
m1=a1+k1;

p=a+k;
k0=p-a;
a0=p-k0;
k1=k-k0;
a1=a-a0;
p1=a1+k1;

Then, correct the naive division:

q=m/p;
r0=fmaf(p,-q,m);
r1=fmaf(p1,-q,m1);
r=r0+r1;
q1=r/p;
q=q+q1;

That'll cost you 2 divisions, but should be near half ulp if I didn't screw up.

But these divisions can be replaced by multiplications with inverse of p without any problem, since the first incorrectly rounded division will be compensated by remainder r, and second incorrectly rounded division does not really matter (the last bits of correction q1 won't change anything).

Vaucluse answered 17/2, 2016 at 0:21 Comment(3)

This seems to be basically the div2 approach suggested by Simon Byrne, using 18 operations including two divisions. This is fully coded, however. My experiments show that maximum error is very near 0.5 ulp on [0.5*K,32*K), so this seems to be doing great when the upper bound of the interval is increased. However, decreasing the lower bound to 0.25*K increases the maximum ulp error to slightly less than 2 ulps, worse than the naive method's maximum error of ~ 1.625 ulp. Is that fixable? – Heartthrob 17/2, 2016 at 1:4

Ah, it looks like I screwed up the sign of error m1... Let me check again. It should be better now that I edited my answer. – Vaucluse 17/2, 2016 at 2:0

With the help of FMA, a double-float division can be coded such that only a single reciprocal operation is needed, rather than two full divisions. I suspect a similar optimization is possible here. – Heartthrob 17/2, 2016 at 7:30

B

3

I don't really have an answer (proper floating point error analyses are very tedious) but a few observations:

Fast reciprocal instructions (such as RCPSS) are not as accurate as division, so you may see a reduction in accuracy if using these.
m is computed exactly if a ∈ [0.5×K_b, 2¹⁺ⁿ×K_b), where K_b is the power of 2 below K (or K itself if K is a power of 2), and n is the number of trailing zeros in the significand of K (i.e. if K is a power of 2, then n=23).
This is similar to a simplified form of the div2 algorithm from Dekker (1971): to expand the range (particularly the lower bound), you'll probably have to incorporate more correction terms from this (i.e. store m as the sum of 2 floats, or use a double).

Bellicose answered 16/2, 2016 at 13:46 Comment(4)

I am familiar with the trade-offs with regard to fast reciprocals. Often the combination of a hardware instruction with the appropriate number of NR steps can get a reciprocal that is almost exactly rounded, i.e. maximum error is exceedingly close to 0.5 ulps, making this feasible. On other platforms, using a proper division plus the relatively small overhead of a few FMAs is still quite acceptable, performance-wise. I am aware of Dekker's work, but have used pretty much used only the addition and multiplication portions of it. I will take another look, to see whether div2 is adaptable. – Heartthrob 16/2, 2016 at 16:18

You're right: fast reciprocal won't make a huge difference because of the correction term. – Bellicose 16/2, 2016 at 16:33

I took a look at double-float division, and it looks like it requires at least 13 operations. I can save two if I only need a float result. But I need at least 6 more operations to compute a+K and a-K, so this approach would require minimum of 17 operations vs 7 with my current code. Seems like a fallback of last resort, the performance impact is hard to justify. – Heartthrob 16/2, 2016 at 23:47

I coded up the approach based on doing all intermediate computation in double-float arithmetic. Unfortunately I needed 11 operations to compute a+K and a-K as two double-float operands. The division of these then takes 11 operations, with only a single reciprocal required, for a total of 22 operations, 15 more than the code in the question which uses 7 operations. For a quick test I chose the interval [K/128, 128*K) and that works just fine, with maximum error exceedingly close to 0.5 ulp. – Heartthrob 17/2, 2016 at 7:27

H

2

Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a, making use of double-float arithmetic for all intermediate computation seems too costly.

Thinking some more about the problem, it is clear that the computation of the remainder of the division, e in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m to represent (a-K) and represented (a+k) as m + 2*K, as this delivers numerically superior results to the straightforward representation.

With relatively small additional computational cost, (a+K) can be represented as a double-float, that is, a head-tail pair p:plo, which leads to the following modified version of my original code:

/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mx = fmaxf (a, K);
mn = fminf (a, K);
plo = (mx - p) + mn;
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
q = fmaf (r, e, q);

Testing shows that this delivers nearly correctly rounded results for a in [K/2, 2²⁴*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.

Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float head-tail pair m:mlo, which leads to the following code variant:

/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
e = e + mlo;
q = fmaf (r, e, q);

Exhaustive testing hows that this delivers nearly correctly rounded results for a in the interval [K/2²⁴, K*2²⁴). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.

As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p, plo. This approach results in the following code:

/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -2.0f*K, m);
t = fmaf (q, -m, t);
e = fmaf (q - 1.0f, -mlo, t);
q = fmaf (r, e, q);

This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2ⁿ nearly correctly rounded results are produced for values of a in the interval [K/2²⁴, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.

Lastly, one might base the residual computation directly on the original variable a to avoid the error inherent in the computation of m and p. This leads to the following code that, for K = 2ⁿ, computes nearly correctly rounded results for a in the interval [K/2²⁴, K/3):

/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
m = a - K;
p = a + K;
r = 1.0f / p;       
q = m * r;
t = fmaf (q + 1.0f, -K, a);
e = fmaf (q, -a, t);
q = fmaf (r, e, q);

Heartthrob answered 17/2, 2016 at 17:53 Comment(0)

A

1

If you can relax the API to return another variable that models the error, then the solution becomes much simpler:

float foo(float a, float k, float *res)
{
    float ret=(a-k)/(a+k);
    *res = fmaf(-ret,a+k,a-k)/(a+k);
    return ret;
}

This solution only handles truncation error of division, but does not handle the loss of precision of a+k and a-k.

To handle those errors, I think I need to use double precision, or bithack to use fixed point.

Test code is updated to artificially generate non zero least significant bits in the input

test code

https://ideone.com/bHxAg8

Amnesia answered 16/2, 2016 at 15:30 Comment(3)

I assume by "other variable to model the error" you mean basically returning the quotient as a head-tail pair (double-float, double-double)? I could easily do that (in my code above that would mean replacing the last line with qlo = r * e), but I don't see how it addresses the issue of rapidly increasing error as the lower interval bound drops below 0.5*K. Divisions are generally expensive on any platform, I would like to avoid having to do two of them; a reciprocal followed by two back-multiplies gives much better performance, so I used that. I will check out your code to explore details. – Heartthrob 16/2, 2016 at 16:31

My test framework indicates by exhaustive testing on the interval [0.5*K, 4*K) that the above code computes the quotient (considered as an unevaluated sum ret:res) with a maximum error of just under 1 ulp, which is better than with the naive computation (around 1.62 ulps) but not as good as the code from my question (near 0.5 ulp). I used K = 2 to test, but any power of two should work equally well provided no underflow/overflow occurs. Please let me know if your test results differ materially from mine. – Heartthrob 16/2, 2016 at 17:44

@Heartthrob No, I agree with your test result. That's why I deleted this answer earlier because I don't think it solves the problem well. – Amnesia 16/2, 2016 at 18:3

C

1

The problem is the addition in (a + K). Any loss of precision in (a + K) is magnified by the division. The problem isn't the division itself.

If the exponents of a and K are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either (a + K) == a (if a has larger magnitude) or (a + K) == K (if K has larger magnitude).

There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider smallest + largest (where smallest is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.) temp = 1/(a + K); result = a * temp - K / temp; won't help much either because you've still got exactly the same (a + K) problem (but it would avoid a similar problem in (a - K)). Also you can't do result = anything / p + anything_error/p_error because division doesn't work like that.

There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of a that can fit in 32-bit floating point. None are likely to be acceptable.

The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of a, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values of a is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.

The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).

The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate temp1 = (a + K); if(a < K) temp2 = (a - K); then switch to "round to negative infinity" and calculate if(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;. Next do a_lower = a and decrease a_lower by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value for lower_bound, then revert back to the previous value of a_lower. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determine upper_bound and a_upper (starting with the original value of a). Finally, interpolate, like a_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.

Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)

Cinereous answered 17/2, 2016 at 4:11 Comment(0)

Recommended topics

Hot tags