In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K)
, where a
is a positive variable argument and K
is a constant. In many cases, K
is a power of two, which is the use case relevant to my work. I am looking for efficient ways to compute this quotient more accurately than can be accomplished with the straightforward division. Hardware support for fused multiply-add (FMA) can be assumed, as this operation is provided by all major CPU and GPU architectures at this time, and is available in C/C++ via the functionsfma()
and fmaf()
.
For ease of exploration, I am experimenting with float
arithmetic. Since I plan to port the approach to double
arithmetic as well, no operations using higher than the native precision of both argument and result may be used. My best solution so far is:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 1 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q, -2.0f*K, m);
e = fmaf (q, -m, t);
q = fmaf (r, e, q);
For arguments a
in the interval [K/2, 4.23*K]
, code above computes the quotient almost correctly rounded for all inputs (maximum error is exceedingly close to 0.5 ulps), provided that K
is a power of 2, and there is no overflow or underflow in intermediate results. For K
not a power of two, this code is still more accurate than the naive algorithm based on division. In terms of performance, this code can be faster than the naive approach on platforms where the floating-point reciprocal can be computed faster than the floating-point division.
I make the following observation when K
= 2n: When the upper bound of the work interval increases to 8*K
, 16*K
, ... maximum error increases gradually and starts to slowly approximate the maximum error of the naive computation from below. Unfortunately, the same does not appear to be true for the lower bound of the interval. If the lower bound drops to 0.25*K
, the maximum error of the improved method above equals the maximum error of the naive method.
Is there a method to compute q = (a - K) / (a + K) that can achieve smaller maximum error (measured in ulp vs the mathematical result) compared to both the naive method and the above code sequence, over a wider interval, in particular for intervals whose lower bound is less than 0.5*K
? Efficiency is important, but a few more operations than are used in the above code can likely be tolerated.
In one answer below, it was pointed out that I could enhance accuracy by returning the quotient as an unevaluated sum of two operands, that is, as a head-tail pair q:qlo
, i.e. similar to the well-known double-float
and double-double
formats. In my code above, this would mean changing the last line to qlo = r * e
.
This approach is certainly useful, and I had already contemplated its use for an extended-precision logarithm for use in pow()
. But it doesn't fundamentally help with the desired widening of the interval on which the enhanced computation provides more accurate quotients. In a particular case I am looking at, I would like to use K=2
(for single precision) or K=4
(for double precision) to keep the primary approximation interval narrow, and the interval for a
is roughly [0,28]. The practical problem I am facing is that for arguments < 0.25*K the accuracy of the improved division is not substantially better than with the naive method.
(a / (a + k)) - (k / (a + k))
? – Sydneya
is nearK
. – Heartthrobdouble
operations are much more expensive (as much as 32 times as expensive asfloat
operations). Since I also want to use the same algorithm fordouble
, there are no cheap "quadruple" operations one can use there. Therefore the requirement for only using "native" width operations (which also makes vectorization easier). – Heartthroba
has an expected error of .5ulp to begin with, andK
might, too. – GilberteRewriting [(a-K)/(a+K) to a/(a+k)-k/(a + k)] will cause [error] to explode
- how is the problem of cancellation different after and before rewrite? – Gilbertea
andK
(in the context of this computation). By simple exhaustive test one can easily show that with K=2**n, the maximum ulp error of the naive computation is < 1.625 ulps, but after the proposed re-write the maximum error is in the 100,000s of ulps. Does your own testing show different results? For proper testing, one needs to ensure no higher precision creeps into intermediate computation. I am using the/fp:strict
of my compiler, and also mark variablesvolatile
. – Heartthrob