The inverse hyperbolic function asinh()
is closely related to the natural logarithm. I am trying to determine the most accurate way to compute asinh()
from the C99 standard math function log1p()
. For ease of experimentation, I am limiting myself to IEEE-754 single-precision computation right now, that is I am looking at asinhf()
and log1pf()
. I intend to re-use the exact same algorithm for double precision computation, i.e. asinh()
and log1p()
, later.
My primary goal is to minimize ulp error, the secondary goal is to minimize the number of incorrectly rounded results, under the constraint that the improved code would at most be minimally slower than the versions posted below. Any incremental improvement to accuracy, say 0.2 ulp, would be welcome. Adding a couple of FMAs (fused multiply-adds) would be fine, on the other hand I am hoping someone could identify a solution which employs a fast rsqrtf()
(reciprocal square root).
The resulting C99 code should lend itself to vectorization, possibly by some minor straightforward transformations. All intermediate computation must occur at the precision of the function argument and result, as any switch to higher precision may have a severe negative performance impact. The code must work correctly both with IEEE-754 denormal support and in FTZ (flush to zero) mode.
So far, I have identified the following two candidate implementations. Note that the code may be easily transformed into a branchless vectorizable version with a single call to log1pf()
, but I have not done so at this stage to avoid unnecessary obfuscation.
/* for a >= 0, asinh(a) = log (a + sqrt (a*a+1))
= log1p (a + (sqrt (a*a+1) - 1))
= log1p (a + sqrt1pm1 (a*a))
= log1p (a + (a*a / (1 + sqrt(a*a + 1))))
= log1p (a + a * (a / (1 + sqrt(a*a + 1))))
= log1p (fma (a / (1 + sqrt(a*a + 1)), a, a)
= log1p (fma (1 / (1/a + sqrt(1/a*a + 1)), a, a)
*/
float my_asinhf (float a)
{
float fa, t;
fa = fabsf (a);
#if !USE_RECIPROCAL
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = log1pf (t);
}
#else // USE_RECIPROCAL
if (fa > 0x1.0p126f) { // prevent underflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = 1.0f / fa;
t = fmaf (1.0f / (t + sqrtf (fmaf (t, t, 1.0f))), fa, fa);
t = log1pf (t);
}
#endif // USE_RECIPROCAL
return copysignf (t, a); // restore sign
}
With a particular log1pf()
implementation that is accurate to < 0.6 ulps, I am observing the following error statistics when testing exhaustively across all 232 possible IEEE-754 single-precision inputs. When USE_RECIPROCAL = 0
, the maximum error is 1.49486 ulp, and there are 353,587,822 incorrectly rounded results. With USE_RECIPROCAL = 1
, the maximum error is 1.50805 ulp, and there are only 77,569,390 incorrectly rounded results.
In terms of performance, the variant USE_RECIPROCAL = 0
will be faster if reciprocals and full divisions take roughly the same amount of time, but the variant USE_RECIPROCAL = 1
could be faster if very fast reciprocal support is available.
Answers can assume that all basic arithmetic, including FMA (fused multiply-add) is correctly rounded according to IEEE-754 round-to-nearest-or-even mode. In addition, faster, nearly correctly rounded, versions of reciprocal and rsqrtf()
may be available, where "nearly correctly rounded" means the maximum ulp error will be limited to something like 0.53 ulps and the overwhelming majority of results, say > 95%, are correctly rounded. Basic arithmetic with directed roundings may be available at no additional cost to performance.
float
and move todouble
asap, or evenlong double
if your compiler supports 80-bit real number representation. – Hangoutdouble
is more native thanfloat
, which is now only suitable for some very restricted systems like embedded. – Hangoutfloat
operations can have significantly higher throughput thandouble
operations, and on many platform including x86 with AVX there is no convenient way to use any type wider thandouble
. – Kristinedouble
for better accuracy. – Hangoutdouble
support but it may be up to 32x slower thanfloat
, so not practical given the need for high performance. In addition I am using thefloat
version as an experimental platform for thedouble
version, and there is no hardware support for any higher precision beyonddouble
– Kristinedouble
or do you wantfloat
? Do you want speed or do you want accuracy? You said, in the question, before moving todouble
. My first comment stands. – Hangoutfloat
case, but I am intending to re-use the same algorithm fordouble
later, as I mentioned at the start of the question. – Kristinemy_asinhf()
using#if !USE_RECIPROCAL
is also worst case error of about 1 ULP ata == 7.517228127e-01
and an average error of 1/8 ULP. I do not think a more accurate answer is feasible with approaches used. Perhaps a Minimax polynomial? – Flowerpotsqrt
orrsqrt
(a contracting operation) is used late, rather than early in the computation? Can FMA be used more effectively than it currently is to improve accuracy? Is there a way to implementsqrt1pm1
that would help accuracy (I have no practical experience with that function)? – Kristinersqrt
. Getting to a faithfully-rounded implementation (< 1 ulp) would be great. My current thinking is that shooting for any bound tighter than that would definitely mean negatively impacting performance, which I want to avoid. – Kristine#if !USE_RECIPROCAL
is to compute the input tolog1pf()
as a double-floatarg_hi, arg_lo
, then computeasinh()
asfmaf (arg_lo, 1.0f/(1.0f+arg_hi), log1pf (arg_hi))
. When using a correctly roundedlog1pf()
this reduces the error inasinhf()
to 1 ulp. This is not practical from a performance perspective because it requires either higher precision or expensive double-float computation for the argument transformation. – Kristineasinhf
to computinglog1pf
? – Danieldanielaasinh()
, and happens to be the one I am currently exploring in detail. The attractiveness of such a general approach is that one can build implementations for various math functions from just a few building blocks, with inverse hyperbolic functions in particular being mapped back tolog1p()
. Since the difficult arguments forasinh()
are all small, one could use a minimax polynomial for those, but that would be a yet another code branch (not advantageous if the goal is a vectorizable implementation), and worthy of a separate question. – Kristine