I do *not* want correct rounding for function exp

Asked 3/6, 2017 at 16:52 Answered 8/6, 2017 at 6:34

The GCC implementation of the C mathematical library on Debian systems has apparently an (IEEE 754-2008)-compliant implementation of the function exp, implying that rounding shall always be correct:

(from Wikipedia) The IEEE floating point standard guarantees that add, subtract, multiply, divide, fused multiply–add, square root, and floating point remainder will give the correctly rounded result of the infinite precision operation. No such guarantee was given in the 1985 standard for more complex functions and they are typically only accurate to within the last bit at best. However, the 2008 standard guarantees that conforming implementations will give correctly rounded results which respect the active rounding mode; implementation of the functions, however, is optional.

It turns out that I am encountering a case where this feature is actually hindering, because the exact result of the exp function is often nearly exactly at the middle between two consecutive double values (1), and then the program carries plenty of several further computations, losing up to a factor 400 (!) in speed: this was actually the explanation to my (ill-asked :-S) Question #43530011.

(1) More precisely, this happens when the argument of exp turns out to be of the form (2 k + 1) × 2^-53 with k a rather small integer (like 242 for instance). In particular, the computations involved by pow (1. + x, 0.5) tend to call exp with such an argument when x is of the order of magnitude of 2^-44.

Since implementations of correct rounding can be so much time-consuming in certain circumstances, I guess that the developers will also have devised a way to get a slightly less precise result (say, only up to 0.6 ULP or something like this) in a time which is (roughly) bounded for every value of the argument in a given range… (2)

… But how to do this??

(2) What I mean is that I just do not want that some exceptional values of the argument like (2 k + 1) × 2^-53 would be much more time-consuming than most values of the same order of magnitude; but of course I do not mind if some exceptional values of the argument go much faster, or if large arguments (in absolute value) need a larger computation time.

Here is a minimal program showing the phenomenon:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

int main (void)
 {
  int i;
  double a, c;
  c = 0;
  clock_t start = clock ();
  for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
   {
    a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
    c += exp (a); // Just to be sure that the compiler will actually perform the computation of exp (a).
   }
  clock_t stop = clock ();
  printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
  printf ("Clock time spent: %d\n", stop - start);
  return 0;
 }

Now after gcc -std=c99 program53.c -lm -o program53:

$ ./program53
1.000000e+06
Clock time spent: 13470008
$ ./program53 
1.000000e+06
Clock time spent: 13292721
$ ./program53 
1.000000e+06
Clock time spent: 13201616

On the other hand, with program52 and program54 (got by replacing 0x20000000000000 by resp. 0x10000000000000 and 0x40000000000000):

$ ./program52
1.000000e+06
Clock time spent: 83594
$ ./program52
1.000000e+06
Clock time spent: 69095
$ ./program52
1.000000e+06
Clock time spent: 54694
$ ./program54
1.000000e+06
Clock time spent: 86151
$ ./program54
1.000000e+06
Clock time spent: 74209
$ ./program54
1.000000e+06
Clock time spent: 78612

Beware, the phenomenon is implementation-dependent! Apparently, among the common implementations, only those of the Debian systems (including Ubuntu) show this phenomenon.

P.-S.: I hope that my question is not a duplicate: I searched for a similar question thoroughly without success, but maybe I did note use the relevant keywords… :-/

Smothers answered 3/6, 2017 at 16:52 Comment(27)

I would write my own exp_approx, using an existing implementation of exp as a starting point. – Milda 3/6, 2017 at 16:58

Why are you doing pow(1.0 + x, 0.5) rather than the obvious (and fast) sqrt(1.0 + x)? – Carr 3/6, 2017 at 16:59

@EOF: In my actual program 0.5 is actually a parameter input by the user, which may take any value between 0 and 1. By the way, I realized that there was something strange going on when I found that the program speed crashed down completely when I input an paramter of 0.5 rather than 0.49 or 0.51… ;-) – Falls 3/6, 2017 at 17:19

@Nancy-N Could you post a minimal reproducible example? If the problem is specific to a power of 0.5 exactly, you could special-case it with sqrt(). – Carr 3/6, 2017 at 17:26

Please edit question instead of adding useful info in comment. – Loveinamist 3/6, 2017 at 18:7

@Loveinamist Oops sorry :-S – Falls 3/6, 2017 at 18:10

When you're taking the exp() of numbers very close to 0, floating-point numbers are problematic, since you get a number close to 1 while all the precision is in the difference to one, so you lose most significant digits. It is more precise (and significantly faster in your testcase) to compute exp(x) - 1 through the C math library function expm1(x). If you really need the exp(), it's still much faster to do expm1(x) + 1. A similar concern exists for computing log(1 + x), for which there is the function log1p(). – Carr 3/6, 2017 at 18:41

@EOF Sounds like a good answer to a good question. Please make a nice Q/A pair. – Loveinamist 3/6, 2017 at 18:53

@Loveinamist I'm pretty frustrated with this question. It feels like it could be an interesting question, but the lack of reason behind the testcase (it has next to nothing to do with the original description in the question) makes it difficult to write a comprehensive answer. – Carr 3/6, 2017 at 18:56

@EOF I did not know expm1; indeed it solves the problem in this case, as the value of exp (x) - 1 will not be at the (almost) middle between two consecutive double numbers (though exp (x) is…). So thank you for this suggestion! :-) However I would still be interested by an implementation of exp not looking for correct rounding: first because using exp makes the code easier to understand, but above all because I think that it would be good anyway to be certain that no peculiar value of your argument can crash the speed of your function ;-) – Falls 3/6, 2017 at 21:27

@EOF I do not quite understand what you mean by “the lack of reason behind the testcase” and why you feel it difficult to write a comprehensive answer… Could you be more specific? Personally, the way I see it, there are two distinct points in my question: first, a problem of code speed for one peculiar program, to which your expm1 suggestion answers well; and next, motivated by the first point but having its own interest, looking for an implementation of exp slightly less precise but with decent speed for all the value of its argument. – Falls 3/6, 2017 at 21:32

@Nancy-N Where you see a precise exp() wasting your time with meaningless precision (after all, double has plenty of bits anyway, right?), I see a library function desperately trying to protect the last handful of bits of precision from the ravages of your poorly chosen computation. You can of course stab it in the back and use a less accurate approximation, but don't be surprised if you end up with a final result that is barely in the right order of magnitude. – Carr 5/6, 2017 at 10:33

@Nancy-N See, log(1 + x) is approximately x for abs(x) much smaller than one. Now, if x is pow(2, -44), exp(log(1 + x)) is 1 + x, but of the significant bits of x you have at best 53 - 44 == 9 bits of precision left. That's only 512 different possible results for your inputs in that range. How many more bits can you spare for a faster approximation? – Carr 5/6, 2017 at 10:39

@EOF I see your point; however (which you could not guess) it is not relevant in my actual context. In the program where I encountered the issue, what I do is convoluting some function with another function defined in terms of exp, which leads to making the sum of a handful of terms using exp with quite different arguments. Long story short, in this context I do not mind having a crappy computation of expm1 (x), provided my computation of exp (x) is fair. – Falls 6/6, 2017 at 12:11

… And of course, however clever it may be to use expm1, my main question still remains: have the developers of the mathematical library devised a way to disregard exact rounding; and if yes, what is this way…? – Falls 6/6, 2017 at 12:21

@Nancy-N: And when I asked for a testcase, I'd have liked to see the actual usecase, or something a bit simplified. Something where the part that I "could not have guessed" is represented. It's the "complete" part of minimal reproducible example. – Carr 6/6, 2017 at 15:14

@Nancy-N: To answer the specific question you insist on: Floating-point is hard, and often times counterintuitive. Not every programmer has read what they should have. When libraries used to allow some slightly inaccurate rounding, people complained about the precision of the library function when their inaccurate computations inevitably went wrong and produced nonsense. In response, the library writers made their libraries exactly rounded, so now people can't shift the blame to them. – Carr 6/6, 2017 at 15:30

exp is not provided by GCC, but by your C standard library i.e. libc.so and libm.so. However, read floating-point-gui.de – Hookah 7/6, 2017 at 4:41

Does the compilation flag -ffast-math help? – Zapateado 7/6, 2017 at 4:45

@Zapateado No it does not help :-/ I guess this is because the request for correct rounding comes from the mathematical library, not from the compiler itself. – Falls 7/6, 2017 at 19:44

@EOF Thank you for this explanation about the reasons behind the choice of the developers of the mathematical library :-) —or rather of the devisers of the IEEE 754-2008 standard, in this context. Maybe you could combine this comment with the latest comment of NominalAnimal under your answer (the comment saying “Take a look at ACML and Intel MKL…”) into a standard answer, which I would then validate? ;-) – Falls 7/6, 2017 at 20:5

@Nancy-N: I made an edit. Take a look. – Carr 7/6, 2017 at 20:33

@EOF Concerning the actual usecase, honestly there would no point in showing it: - First, because (an excerpt of) the actual program would not be enlightening at all: I had written it to perform numerical investigation on some particular random object which I needed to understand better for some purely mathematical research… (I am a mathematician ;-) ). - Second, because my problem could actually be bypassed by some classical code optimization: as the line of code in which I used the exponential would only call ~100 different arguments, I just pre-computed all the results in a table! ;-) – Falls 7/6, 2017 at 20:36

@Nancy-N: Ah, memoization is something I didn't think of (at least somewhat because it's a bit situational. There are easily enough double values in the range around pow(2, -44) to make memoization inapplicable in general). Either way, it feels like there might be a lot of fun optimizations that could have been discussed here if more of the context was given. I like discussing this kind of thing ;-). – Carr 7/6, 2017 at 20:42

@EOF (continued) … So, it was mainly by curiosity that I eagerly wanted to understand where the problem came from (which I managed to do mainly by myself: it was this requirement for exact rounding). Once I found the cause of the phenomenon, I wondered whether there was a way to be sure that I would never encounter such a problem again, by re#defineing the exp function as something with only 1 ULP precision, but never very slow… – Falls 7/6, 2017 at 20:44

@EOF And just for fun, here is the original line of code which caused me that trouble: noise1 = (1 - pow (1 - exp (x), 2 * H)) / (2 * H); where x was a variable signed number of order of magnitude 16, and H was a “Hurst parameter”, remaining the same during one program run, but likely to take any value between 0 and 1 (excluded). I was running my program for different values of H, and was struck to observe that the case H = 0.25 was running some 10 times slower than all the other values… – Falls 7/6, 2017 at 20:55

@Nancy-N I believe (-expm1(2 * H * log1p(-exp(x)))) / (2 * H) will give very fast (and much more accurate) results for the line you posted ;-) – Carr 7/6, 2017 at 21:55

To answer the general question on why the library functions are required to give correctly rounded results:

Floating-point is hard, and often times counterintuitive. Not every programmer has read what they should have. When libraries used to allow some slightly inaccurate rounding, people complained about the precision of the library function when their inaccurate computations inevitably went wrong and produced nonsense. In response, the library writers made their libraries exactly rounded, so now people cannot shift the blame to them.

In many cases, specific knowledge about floating point algorithms can produce considerable improvements to accuracy and/or performance, like in the testcase:

Taking the exp() of numbers very close to 0 in floating-point numbers is problematic, since the result is a number close to 1 while all the precision is in the difference to one, so most significant digits are lost. It is more precise (and significantly faster in this testcase) to compute exp(x) - 1 through the C math library function expm1(x). If the exp() itself is really needed, it is still much faster to do expm1(x) + 1.

A similar concern exists for computing log(1 + x), for which there is the function log1p(x).

A quick fix that speeds up the provided testcase:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

int main (void)
{
  int i;
  double a, c;
  c = 0;
  clock_t start = clock ();
  for (i = 0; i < 1e6; ++i) // Doing a large number of times the same type of computation with different values, to smoothen random fluctuations.
    {
      a = (double) (1 + 2 * (rand () % 0x400)) / 0x20000000000000; // "a" has only a few significant digits, and its last non-zero digit is at (fixed-point) position 53.
      c += expm1 (a) + 1; // replace exp() with expm1() + 1
    }
  clock_t stop = clock ();
  printf ("%e\n", c); // Just to be sure that the compiler will actually perform the computation.
  printf ("Clock time spent: %d\n", stop - start);
  return 0;
}

For this case, the timings on my machine are thus:

Original code

1.000000e+06

Clock time spent: 21543338

Modified code

1.000000e+06

Clock time spent: 55076

Programmers with advanced knowledge about the accompanying trade-offs may sometimes consider using approximate results where the precision is not critical

For an experienced programmer it may be possible to write an approximative implementation of a slow function using methods like Newton-Raphson, Taylor or Maclaurin polynomials, specifically inexactly rounded specialty functions from libraries like Intel's MKL, AMD's AMCL, relaxing the floating-point standard compliance of the compiler, reducing precision to ieee754 binary32 (float), or a combination of these.

Note that a better description of the problem would enable a better answer.

Carr answered 3/6, 2017 at 19:1 Comment(10)

Excellent suggestion, EOF. I examined a few billion uniformly pseudo-random -0.125 < x < 0.125, and 0 < y < 1. Calculating (1+x)**y using pow(x+1, y) takes median 2.6 times as long as 1.0 + expm1(y * log1p(x)), with at most one unit in the least significant place difference between the two. Interestingly enough, 87.200% of the results (of the two functions) match, and only 12.800% differ. – Rochdale 3/6, 2017 at 20:50

Ditto, "excellent suggestion". I've been programming in C for a long time, and was never aware of those functions, which I could have used once or twice (though not all that often). Good to Know! Thanks. – Incandesce 3/6, 2017 at 23:38

But I'm a little confused... won't expm1(small_number)+1.0 still lose the left-hand term, just like I'd suppose, e.g., (1.e-99) + 1.0 would presumably lose it? I'd think expm1() would mainly be useful when you don't need the +1 term to begin with (e.g., when exp()-1 is what you want). – Incandesce 3/6, 2017 at 23:51

@JohnForkosh: I tried explaining that in the first couple of sentences. If you compute exp(x) for abs(x) much smaller than one, you get a number close to one where all the relevant bits are the difference to one. You can try continuing your computation with exp(x), but your results will have terrible precision. It's much better to rewrite your expressions to use the mathematical value of exp(x)-1, and compute it using expm1(x). – Carr 5/6, 2017 at 10:29

Thanks, @EOF , and I agree with your comment. But note the line in your code c += expm1(a)+1; So that will first evaluate expm1(a) as, let's say, 0.000...0001 = 1.e-whatever (maybe expressed in some "excess" notation by typical fpu's). But then you immediately add 1.0 to that, which would seem to lose the "trailing" .000...0001, whenever it's sufficiently small, in exactly the way discussed. So by the time you get to your c+= part, the expm1() advantage is lost. Please see next comment... – Incandesce 5/6, 2017 at 11:18

...continued. What I'd think you'd have to do is as follows. Inside the loop say only the c+=expm1(a) part. Then after the loop, after all those little numbers have been accumulated, you can maybe say c+=1.e6 (the number of passes through the loop), so that the total of all the little numbers isn't lost. But if that total's still too small, maybe printf() the integer part (number of passes) followed by the small c total. – Incandesce 5/6, 2017 at 11:23

@JohnForkosh You are right. As I wrote, this answer is only a quick and dirty fix for the performance problem, rather than the (badly needed) fix for the abysmal precision of the OP's code. Unfortunately, the testcase doesn't really offer much to work with. – Carr 5/6, 2017 at 11:26

@NominalAnimal I find all of that quite interesting; however, I keep being interested in my original question: have the developers of the mathematical library devised a way to disregard exact rounding; and if yes, what is this way…? – Falls 6/6, 2017 at 12:24

@Nancy-N: There are a slew of compiler options like GCC's -freciprocal-math that do trade precision for speed, but the <math.h> library functions definitely strive for precision and not speed. Intel MKL does have vmdExp() that takes a mode parameter ("high accuracy", "low accuracy", "enhanced performance accuracy"). AMD AMCL has fastexp() (1 ULP precision, 75 clocks for most arguments). So, my answer to that is "Well, no; but there are other libraries, and you can even write your own code, to provide less-precise more performant versions. Take a look at ACML and Intel MKL." – Rochdale 6/6, 2017 at 14:36

@NominalAnimal Thank you, this is exactly the kind of answer I was looking for! :-D On my suggestion, EOF has combined your comment with one of his to make the answer that I have validated ;-) – Falls 7/6, 2017 at 20:46

Regarding your comment to @EOF 's answer, the "write your own" remark from @NominalAnimal seems simple enough here, even trivial, as follows.

Your original code above seems to have a max possible argument for exp() of a=(1+2*0x400)/0x2000...=4.55e-13 (that should really be 2*0x3FF, and I'm counting 13 zeroes after your 0x2000... which makes it 2x16^13). So that 4.55e-13 max argument is very, very small.

And then the trivial taylor expansion is exp(a)=1+a+(a^2)/2+(a^3)/6+... which already gives you all double's precision for such small arguments. Now, you'll have to discard the 1 part, as explained above, and then that just reduces to expm1(a)=a*(1.+a*(1.+a/3.)/2.) And that should go pretty darn quick! Just make sure a stays small. If it gets a little bigger, just add the next term, a^4/24 (you see how to do that?).

>>EDIT<<

I modified the OP's test program as follows to test a little more stuff (discussion follows code)

/* https://stackoverflow.com/questions/44346371/
   i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 16               /*denominator will be (multiplier)xBASE^EXPON*/
#define EXPON 13
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/

int main (int argc, char *argv[]) {
  int N          = (argc>1?atoi(argv[1]):1e6),
      multiplier = (argc>2?atoi(argv[2]):2),
      isexp      = (argc>3?atoi(argv[3]):1); /* flags to turn on/off exp() */
  int isexpm1    = 1;                        /* and expm1() for timing tests*/
  int i, n=0;
  double denom = ((double)multiplier)*pow((double)BASE,(double)EXPON);
  double a, c=0.0, cm1=0.0, tm1=0.0;
  clock_t start = clock();
  n=0;  c=cm1=tm1=0.0;
  /* --- to smooth random fluctuations, do the same type of computation
         a large number of (N) times with different values --- */
  for (i=0; i<N; i++) {
    n++;
    a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
                                 significant digits, and its last non-zero
                                 digit is at (fixed-point) position 53. */
    if ( isexp ) c += exp(a); /* turn this off to time expm1() alone */
    if ( isexpm1 ) {          /* you can turn this off to time exp() alone, */
      cm1 += expm1(a);        /* but difference is negligible */
      tm1 += taylorm1(a); }
    } /* --- end-of-for(i) --- */
  int nticks = (int)(clock()-start);
  printf ("N=%d, denom=%dx%d^%d, Clock time: %d (%.2f secs)\n",
         n, multiplier,BASE,EXPON,
         nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
  printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
           c,c-(double)n,cm1,tm1);
  return 0;
  } /* --- end-of-function main() --- */

Compile and run it as test to reproduce OP's 0x2000... scenario, or run it with (up to three) optional args test #trials multiplier timeexp where #trials defaults to the OP's 1000000, and multipler defaults to 2 for the OP's 2x16^13 (change it to 4, etc, for her other tests). For the last arg, timeexp, enter a 0 to do only the expm1() (and my unnecessary taylor-like) calculation. The point of that is to show that the bad-timing-cases displayed by the OP disappear with expm1(), which takes "no time at all" regardless of multiplier.

So default runs, test and test 1000000 4, produce (okay, I called the program rounding)...

bash-4.3$ ./rounding 
N=1000000, denom=2x16^13, Clock time: 11155070 (11.16 secs)
         c=1.00000000000000023283e+06,
         c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 4
N=1000000, denom=4x16^13, Clock time: 200211 (0.20 secs)
         c=1.00000000000000011642e+06,
         c-n=1.164153e-10, cm1=5.680083e-08, tm1=5.680083e-08

So the first thing you'll note is that the OP's c-n using exp() differs substantially from both cm1==tm1 using expm1() and my taylor approx. If you reduce N they come into agreement, as follows...

N=10, denom=2x16^13, Clock time: 941 (0.00 secs)
         c=1.00000000000007140954e+01,
         c-n=7.140954e-13, cm1=7.127632e-13, tm1=7.127632e-13
bash-4.3$ ./rounding 100
N=100, denom=2x16^13, Clock time: 5506 (0.01 secs)
         c=1.00000000000010103918e+02,
         c-n=1.010392e-11, cm1=1.008393e-11, tm1=1.008393e-11
bash-4.3$ ./rounding 1000
N=1000, denom=2x16^13, Clock time: 44196 (0.04 secs)
         c=1.00000000000011345946e+03,
         c-n=1.134595e-10, cm1=1.140730e-10, tm1=1.140730e-10
bash-4.3$ ./rounding 10000
N=10000, denom=2x16^13, Clock time: 227215 (0.23 secs)
         c=1.00000000000002328306e+04,
         c-n=2.328306e-10, cm1=1.131288e-09, tm1=1.131288e-09
bash-4.3$ ./rounding 100000
N=100000, denom=2x16^13, Clock time: 1206348 (1.21 secs)
         c=1.00000000000000232831e+05,
         c-n=2.328306e-10, cm1=1.133611e-08, tm1=1.133611e-08

And as far as timing of exp() versus expm1() is concerned, see for yourself...

bash-4.3$ ./rounding 1000000 2  
N=1000000, denom=2x16^13, Clock time: 11168388 (11.17 secs)
         c=1.00000000000000023283e+06,
         c-n=2.328306e-10, cm1=1.136017e-07, tm1=1.136017e-07
bash-4.3$ ./rounding 1000000 2 0
N=1000000, denom=2x16^13, Clock time: 24064 (0.02 secs)
         c=0.00000000000000000000e+00,
         c-n=-1.000000e+06, cm1=1.136017e-07, tm1=1.136017e-07

Question: you'll note that once the exp() calculation reaches N=10000 trials, its sum remains constant regardless of larger N. Not sure why that would be happening.

>>__SECOND EDIT__<<

Okay, @EOF , "you made me look" with your "heirarchical accumulation" comment. And that indeed works to bring the exp() sum closer (much closer) to the (presumably correct) expm1() sum. The modified code's immediately below followed by a discussion. But one discussion note here: recall multiplier from above. That's gone, and in its same place is expon so that denominator is now 2^expon where the default is 53, matching OP's default (and I believe better matching how she was thinking about it). Okay, and here's the code...

/* https://stackoverflow.com/questions/44346371/
   i-do-not-want-correct-rounding-for-function-exp/44397261 */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define BASE 2                /*denominator=2^EXPON, 2^53=2x16^13 default */
#define EXPON 53
#define taylorm1(a) (a*(1.+a*(1.+a/3.)/2.)) /*expm1() approx for small args*/

int main (int argc, char *argv[]) {
  int N          = (argc>1?atoi(argv[1]):1e6),
      expon      = (argc>2?atoi(argv[2]):EXPON),
      isexp      = (argc>3?atoi(argv[3]):1), /* flags to turn on/off exp() */
      ncparts    = (argc>4?atoi(argv[4]):1), /* #partial sums for c */
      binsize    = (argc>5?atoi(argv[5]):10);/* #doubles to sum in each bin */
  int isexpm1    = 1;                        /* and expm1() for timing tests*/
  int i, n=0;
  double denom = pow((double)BASE,(double)expon);
  double a, c=0.0, cm1=0.0, tm1=0.0;
  double csums[10], cbins[10][65537]; /* c partial sums and heirarchy */
  int nbins[10], ibin=0;      /* start at lowest level */
  clock_t start = clock();
  n=0;  c=cm1=tm1=0.0;
  if ( ncparts > 65536 ) ncparts=65536;  /* array size check */
  if ( ncparts > 1 ) for(i=0;i<ncparts;i++) cbins[0][i]=0.0; /*init bin#0*/
  /* --- to smooth random fluctuations, do the same type of computation
         a large number of (N) times with different values --- */
  for (i=0; i<N; i++) {
    n++;
    a = (double)(1 + 2*(rand()%0x400)) / denom; /* "a" has only a few
                                 significant digits, and its last non-zero
                                 digit is at (fixed-point) position 53. */
    if ( isexp ) {            /* turn this off to time expm1() alone */
      double expa = exp(a);   /* exp(a) */
      c += expa;              /* just accumulate in a single "bin" */
      if ( ncparts > 1 ) cbins[0][n%ncparts] += expa; } /* accum in ncparts */
    if ( isexpm1 ) {          /* you can turn this off to time exp() alone, */
      cm1 += expm1(a);        /* but difference is negligible */
      tm1 += taylorm1(a); }
    } /* --- end-of-for(i) --- */
  int nticks = (int)(clock()-start);
  if ( ncparts > 1 ) {        /* need to sum the partial-sum bins */
    nbins[ibin=0] = ncparts;  /* lowest-level has everything */
    while ( nbins[ibin] > binsize ) { /* need another heirarchy level */
      if ( ibin >= 9 ) break; /* no more bins */
      ibin++;                 /* next available heirarchy bin level */
      nbins[ibin] = (nbins[ibin-1]+(binsize-1))/binsize; /*#bins this level*/
      for(i=0;i<nbins[ibin];i++) cbins[ibin][i]=0.0; /* init bins */
      for(i=0;i<nbins[ibin-1];i++) {
        cbins[ibin][(i+1)%nbins[ibin]] += cbins[ibin-1][i]; /*accum in nbins*/
        csums[ibin-1] += cbins[ibin-1][i]; } /* accumulate in "one bin" */
      } /* --- end-of-while(nprevbins>binsize) --- */
    for(i=0;i<nbins[ibin];i++) csums[ibin] += cbins[ibin][i]; /*highest level*/
    } /* --- end-of-if(ncparts>1) --- */
  printf ("N=%d, denom=%d^%d, Clock time: %d (%.2f secs)\n", n, BASE,expon,
         nticks, ((double)nticks)/((double)CLOCKS_PER_SEC));
  printf ("\t c=%.20e,\n\t c-n=%e, cm1=%e, tm1=%e\n",
           c,c-(double)n,cm1,tm1);
  if ( ncparts > 1 ) { printf("\t binsize=%d...\n",binsize);
    for (i=0;i<=ibin;i++) /* display heirarchy */
      printf("\t level#%d: #bins=%5d, c-n=%e\n",
      i,nbins[i],csums[i]-(double)n); }
  return 0;
  } /* --- end-of-function main() --- */

Okay, and now you can notice two additional command-line args following the old timeexp. They are ncparts for the initial number of bins into which the entire #trials will be distributed. So at the lowest level of the heirarchy, each bin should (modulo bugs:) have the sum of #trials/ncparts doubles. The argument after that is binsize, which will be the number of doubles summed in each bin at every successive level, until the last level has fewer (or equal) #bins as binsize. So here's an example dividing 1000000 trials into 50000 bins, meaning 20doubles/bin at the lowest level, and 5doubles/bin thereafter...

bash-4.3$ ./rounding 1000000 53 1 50000 5 
N=1000000, denom=2^53, Clock time: 11129803 (11.13 secs)
         c=1.00000000000000465661e+06,
         c-n=4.656613e-09, cm1=1.136017e-07, tm1=1.136017e-07
         binsize=5...
         level#0: #bins=50000, c-n=4.656613e-09
         level#1: #bins=10002, c-n=1.734588e-08
         level#2: #bins= 2002, c-n=7.974450e-08
         level#3: #bins=  402, c-n=1.059379e-07
         level#4: #bins=   82, c-n=1.133885e-07
         level#5: #bins=   18, c-n=1.136214e-07
         level#6: #bins=    5, c-n=1.138542e-07

Note how the c-n for exp() converges pretty nicely towards the expm1() value. But note how it's best at level#5, and isn't converging uniformly at all. And note if you break the #trials into only 5000 initial bins, you get just as good a result,

bash-4.3$ ./rounding 1000000 53 1 5000 5
N=1000000, denom=2^53, Clock time: 11165924 (11.17 secs)
         c=1.00000000000003527384e+06,
         c-n=3.527384e-08, cm1=1.136017e-07, tm1=1.136017e-07
         binsize=5...
         level#0: #bins= 5000, c-n=3.527384e-08
         level#1: #bins= 1002, c-n=1.164153e-07
         level#2: #bins=  202, c-n=1.158332e-07
         level#3: #bins=   42, c-n=1.136214e-07
         level#4: #bins=   10, c-n=1.137378e-07
         level#5: #bins=    4, c-n=1.136214e-07

In fact, playing with ncparts and binsize doesn't seem to show much sensitivity, and it's not always "more is better" (i.e., less for binsize) either. So I'm not sure exactly what's going on. Could be a bug (or two), or could be yet another question for @EOF ...???

>>EDIT -- example showing pair addition "binary tree" heirarchy<<

Example below added as per @EOF 's comment (Note: re-copy preceding code. I had to edit nbins[ibin] calculation for each next level to nbins[ibin]=(nbins[ibin-1]+(binsize-1))/binsize; from nbins[ibin]=(nbins[ibin-1]+2*binsize)/binsize; which was "too conservative" to create ...16,8,4,2 sequence)

bash-4.3$ ./rounding 1024 53 1 512 2
N=1024, denom=2^53, Clock time: 36750 (0.04 secs)
         c=1.02400000000011573320e+03,
         c-n=1.157332e-10, cm1=1.164226e-10, tm1=1.164226e-10
         binsize=2...
         level#0: #bins=  512, c-n=1.159606e-10
         level#1: #bins=  256, c-n=1.166427e-10
         level#2: #bins=  128, c-n=1.166427e-10
         level#3: #bins=   64, c-n=1.161879e-10
         level#4: #bins=   32, c-n=1.166427e-10
         level#5: #bins=   16, c-n=1.166427e-10
         level#6: #bins=    8, c-n=1.166427e-10
         level#7: #bins=    4, c-n=1.166427e-10
         level#8: #bins=    2, c-n=1.164153e-10

>>EDIT -- to show @EOF's elegant solution in comment below<<

"Pair addition" can be elegantly accomplished recursively, as per @EOF's comment below, which I'm reproducing here. (Note case 0/1 at end-of-recursion to handle n even/odd.)

  /* Quoting from EOF's comment...
   What I (EOF) proposed is effectively a binary tree of additions:
   a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
   Like this: Add adjacent pairs of elements, this produces
   a new sequence of n/2 elements.
   Recurse until only one element is left.
   (Note that this will require n/2 elements of storage,
   rather than a fixed number of bins like your implementation) */
  double trecu(double *vals, double sum, int n) {
      int midn = n/2;
      switch (n) {
        case  0: break;
        case  1: sum += *vals; break;
        default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
      return(sum);
      }

Incandesce answered 6/6, 2017 at 18:26 Comment(17)

You might want to name Horner's rule if you propose it. Also, I'm pretty sure expm1() already uses something like this, so implementing it manually is probably not going to be of much use, especially since the OP appears to want a more generally applicable function than expm1(). Taylor or Maclaurin polynomials are a possibility in general, but you'd have to be careful with the radius of convergence. – Carr 6/6, 2017 at 20:13

Agreed. Often it is looking at the underlying formula, rather than any single operation, that yields best rewards. For example, in molecular dynamics, classical potential models tend to look pretty much horrific, so they tend to be approximated using many easy cubic splines instead. Another example I often see is first computing some value of type r = hard(x,y,...) / alsohard(x,y,...); and then proceed to test r >= 0.0 && r <= 1.0. At minimum, you can trade the divison for a conditional sign check (of alsohard(x,y,...)). – Rochdale 6/6, 2017 at 21:24

@NominalAnimal Okay, right, I was (incorrectly, it seems) focusing on your particular comment to Nancy-N above, "strive for precision and not speed". And I mistakenly concluded she was just trying to get a quick-and-accurate-enough approximation for her small args. Re-reading her original post, I see that doesn't seem to be her primary intent. (So I'd "delete" this answer, but wouldn't want to lose both your and EOF's interesting followup comments.) – Incandesce 7/6, 2017 at 1:34

@EOF Thanks for the "Horner's rule" reference. It rang a (somewhat distant) bell when I read it, so I've probably come across it in school or on-some-job afterwards, but I wasn't aware that's what I was doing until I googled it. Anyway, this particular case is so trivial that it barely (if at all) merits "rule". And, yeah, as mentioned to NominalAnimal above, I don't seem to have addressed her principal concern. P.S. You sure expm1() uses something like that? I haven't tested it, but assume it works for larger args, where you couldn't just truncate the series after a few terms. – Incandesce 7/6, 2017 at 1:45

(double)multiplier)*pow((double)BASE,(double)EXPON) all the casts here are redundant. multiplier*pow(BASE, EXPON) is much more readable – Nigrosine 7/6, 2017 at 5:7

The answer to the question at the end of your answer is a direct consequence of floating-point arithmetic: You accumulate over a sequence of 1.0 + x, where abs(x) is much smaller than 1. Now, of you multiplied instead, you'd keep most of the significant bits of x in (1.0 + x)*n - n =approx= n*x. But since you accumulate linearly (nextpartialsum = previouspartialsum + (1+xi), at some point previouspartialsum is so much larger than 1+xi that you lose all of the bits of xi. Eventually, you'd lose the significant bits of 1 too btw., but that takes a lot more summing. – Carr 7/6, 2017 at 5:36

Summing over floating-point sequences is non-trivial. You can instead sum hierarchically (sum over pairs, then pairs of pairs, ...), do Kahan summation or use a few other schemes. – Carr 7/6, 2017 at 5:42

@EOF Thanks for the explanation, none of which (including Kahan summation) I was previously aware of, but maybe should have been. I haven't come across much 1+epsilon-type stuff in this particular kind of situation, but en.wikipedia.org/wiki/Kahan_summation_algorithm is something I might have used on several occasions if I'd known about it. (P.S. looks like you were also right about expm1(), judging from its negligible timing.) – Incandesce 7/6, 2017 at 6:21

@LưuVĩnhPhúc Don't worry, doesn't have to be readable -- nobody's ever going to be reading it:). But you might wanna be careful about that pow(BASE,EXPON) part. Passing int arguments to a function expecting doubles might get you into trouble. Not sure if maybe the compiler knows all about its mlib functions, and will clean up your mess for you, but am pretty sure it's not a good habit to get into. – Incandesce 7/6, 2017 at 6:28

@JohnForkosh there's no problem whatsoever because int is always promoted to double in that case – Nigrosine 7/6, 2017 at 7:9

@EOF I've added a second edit demonstrating code to exercise your "sum heirarchically" comment. Seems to work pretty well, as noted in the edit. As also noted there, I didn't exhaustively debug it to QA standards, i.e., as per gpl, no "warranty of merchantability":) – Incandesce 7/6, 2017 at 10:2

@JohnForkosh: Your code is now completely unreadable. Also, you're not doing what I proposed. You are effectively computing n partial sums and adding them in the end. This is better than making a single linear sum, but not much. What I proposed is effectively a binary tree of additions: a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)). Like this: Add adjacent pairs of elements, this produces a new sequence of n/2 elements. Recurse until only one element is left. Note that this will require n/2 elements of storage, rather than a fixed number of bins like your implementation. – Carr 7/6, 2017 at 13:19

But if you're careful about scheduling your additions, you might get away with O(log(n)) extra storage. – Carr 7/6, 2017 at 13:21

@EOF Sorry about "unreadable". But it can do what you proposed, and I added an edit illustrating that. Problem is that #trials has to be very small to accommodate available static buffers. Of course I could do what you suggested, or devise a better scheme along my lines (e.g., really only need two buffer levels that can be "swapped"). But the whole thing's just a throw-away test/demo, which I don't want to turn into a project. Just wanted to get a more concrete handle on the effect(s) of your very interesting suggestions. – Incandesce 7/6, 2017 at 21:45

@JohnForkosh Why not simply

static double trecu(double const *vals, double accu, size_t n) {   size_t half = n/2;   switch (n)      {     case 0: return accu;     case 1: return accu + *vals;     default: return trecu(vals + half, trecu(vals, accu, half), n - half);     } }

, to be called initially with accu = 0? Of course you can replace the double *vals accesses with function calls, but it's a manual change due to the lack of rages/iterators in C. – Carr 7/6, 2017 at 21:52

@EOF Yeah, that's terrific!!! I added yet another edit to cut-and-paste your method. I guess I should've picked up on that from your earlier comment, comprising its "detailed design". – Incandesce 7/6, 2017 at 22:24

@EOF Uh-oh!!!:) I built your trecu() into the test program, and it does the sum, but it's not doing the "binary tree" type of summation. Just a "plain sum". I wrote a whole separate answer immediately below this one with a shorter (and hopefully less "unreadable":) test program to explicitly exercise trecu() and illustrate its behavior. And that indeed reproduces what I saw with this test program based on the OP's code. Please take a look when you have a chance, and let me ("us" if anybody else is still following:) know what you think. Thanks. – Incandesce 8/6, 2017 at 7:17

This is an "answer"/followup to EOF's preceding comments re his trecu() algorithm and code for his "binary tree summation" suggestion. "Prerequisites" before reading this are reading that discussion. It would be nice to collect all that in one organized place, but I haven't done that yet...

...What I did do was build EOF's trecu() into the test program from the preceding answer that I'd written by modifying the OP's original test program. But then I found that trecu() generated exactly (and I mean exactly) the same answer as the "plain sum" c using exp(), not the sum cm1 using expm1() that we'd expected from a more accurate binary tree summation.

But that test program's a bit (maybe two bits:) "convoluted" (or, as EOF said, "unreadable"), so I wrote a separate smaller test program, given below (with example runs and discussion below that), to separately test/exercise trecu(). Moreover, I also wrote function bintreesum() into the code below, which abstracts/encapsulates the iterative code for binary tree summation that I'd embedded into the preceding test program. In that preceding case, my iterative code indeed came close to the cm1 answer, which is why I'd expected EOF's recursive trecu() to do the same. Long-and-short of it is that, below, same thing happens -- bintreesum() remains close to correct answer, while trecu() gets further away, exactly reproducing the "plain sum".

What we're summing below is just sum(i),i=1...n, which is just the well-known n(n+1)/2. But that's not quite right -- to reproduce OP's problem, summand is not sum(i) alone but rather sum(1+i*10^(-e)), where e can be given on the command-line. So for, say, n=5, you don't get 15 but rather 5.000...00015, or for n=6 you get 6.000...00021, etc. And to avoid a long, long format, I printf() sum-n to remove that integer part. Okay??? So here's the code...

/* Quoting from EOF's comment...
   What I (EOF) proposed is effectively a binary tree of additions:
   a+b+c+d+e+f+g+h as ((a+b)+(c+d))+((e+f)+(g+h)).
   Like this: Add adjacent pairs of elements, this produces
   a new sequence of n/2 elements.
   Recurse until only one element is left. */
#include <stdio.h>
#include <stdlib.h>

double trecu(double *vals, double sum, int n) {
  int midn = n/2;
  switch (n) {
    case  0: break;
    case  1: sum += *vals; break;
    default: sum = trecu(vals+midn, trecu(vals,sum,midn), n-midn); break; }
  return(sum);
  } /* --- end-of-function trecu() --- */

double bintreesum(double *vals, int n, int binsize) {
  double binsum = 0.0;
  int nbin0 = (n+(binsize-1))/binsize,
      nbin1 = (nbin0+(binsize-1))/binsize,
      nbins[2] = { nbin0, nbin1 };
  double *vbins[2] = {
            (double *)malloc(nbin0*sizeof(double)),
            (double *)malloc(nbin1*sizeof(double)) },
         *vbin0=vbins[0], *vbin1=vbins[1];
  int ibin=0, i;
  for ( i=0; i<nbin0; i++ ) vbin0[i] = 0.0;
  for ( i=0; i<n; i++ ) vbin0[i%nbin0] += vals[i];
  while ( nbins[ibin] > 1 ) {
    int jbin = 1-ibin;        /* other bin, 0<-->1 */
    nbins[jbin] = (nbins[ibin]+(binsize-1))/binsize;
    for ( i=0; i<nbins[jbin]; i++ ) vbins[jbin][i] = 0.0;
    for ( i=0; i<nbins[ibin]; i++ )
      vbins[jbin][i%nbins[jbin]] += vbins[ibin][i];
    ibin = jbin;              /* swap bins for next pass */
    } /* --- end-of-while(nbins[ibin]>0) --- */
  binsum = vbins[ibin][0];
  free((void *)vbins[0]);  free((void *)vbins[1]);
  return ( binsum );
  } /* --- end-of-function bintreesum() --- */

#if defined(TESTTRECU)
#include <math.h>
#define MAXN (2000000)
int main(int argc, char *argv[]) {
  int N       = (argc>1? atoi(argv[1]) : 1000000 ),
      e       = (argc>2? atoi(argv[2]) : -10 ),
      binsize = (argc>3? atoi(argv[3]) : 2 );
  double tens = pow(10.0,(double)e);
  double *vals = (double *)malloc(sizeof(double)*MAXN),
         sum = 0.0;
  double trecu(), bintreesum();
  int i;
  if ( N > MAXN ) N=MAXN;
  for ( i=0; i<N; i++ ) vals[i] = 1.0 + tens*(double)(i+1);
  for ( i=0; i<N; i++ ) sum += vals[i];
  printf(" N=%d, Sum_i=1^N {1.0 + i*%.1e} - N  =  %.8e,\n"
         "\t plain_sum-N  = %.8e,\n"
         "\t trecu-N      = %.8e,\n"
         "\t bintreesum-N = %.8e \n",
         N, tens, tens*((double)N)*((double)(N+1))/2.0,
          sum-(double)N,
         trecu(vals,0.0,N)-(double)N,
         bintreesum(vals,N,binsize)-(double)N );
  } /* --- end-of-function main() --- */
#endif

So if you save that as trecu.c, then compile it as cc –DTESTTRECU trecu.c –lm –o trecu And then run with zero to three optional command-line args as trecu #trials e binsize Defaults are #trials=1000000 (like OP's program), e=–10, and binsize=2 (for my bintreesum() function to do a binary-tree sum rather than larger-size bins).

And here are some test results illustrating the problem described above,

bash-4.3$ ./trecu              
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-10} - N  =  5.00000500e+01,
         plain_sum-N  = 5.00000500e+01,
         trecu-N      = 5.00000500e+01,
         bintreesum-N = 5.00000500e+01 
bash-4.3$ ./trecu 1000000 -15
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-15} - N  =  5.00000500e-04,
         plain_sum-N  = 5.01087168e-04,
         trecu-N      = 5.01087168e-04,
         bintreesum-N = 5.00000548e-04 
bash-4.3$ 
bash-4.3$ ./trecu 1000000 -16
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-16} - N  =  5.00000500e-05,
         plain_sum-N  = 6.67552231e-05,
         trecu-N      = 6.67552231e-05,
         bintreesum-N = 5.00001479e-05 
bash-4.3$ 
bash-4.3$ ./trecu 1000000 -17
 N=1000000, Sum_i=1^N {1.0 + i*1.0e-17} - N  =  5.00000500e-06,
         plain_sum-N  = 0.00000000e+00,
         trecu-N      = 0.00000000e+00,
         bintreesum-N = 4.99992166e-06

So you can see that for the default run, e=–10, everybody's doing everything right. That is, the top line that says "Sum" just does the n(n+1)/2 thing, so presumably displays the right answer. And everybody below that agrees for the default e=–10 test case. But for the e=–15 and e=–16 cases below that, trecu() exactly agrees with the plain_sum, while bintreesum stays pretty close to the right answer. And finally, for e=–17, plain_sum and trecu() have "disappeared", while bintreesum()'s still hanging in there pretty well.

So trecu()'s correctly doing the sum all right, but its recursion's apparently not doing that "binary tree" type of thing that my more straightforward iterative bintreesum()'s apparently doing correctly. And that indeed demonstrates that EOF's suggestion for "binary tree summation" realizes quite an improvement over the plain_sum for these 1+epsilon kind of cases. So we'd really like to see his trecu() recursion work!!! When I originally looked at it, I thought it did work. But that double-recursion (is there a special name for that?) in his default: case is apparently more confusing (at least to me:) than I thought. Like I said, it is doing the sum, but not the "binary tree" thing.

Okay, so who'd like to take on the challenge and explain what's going on in that trecu() recursion? And, maybe more importantly, fix it so it does what's intended. Thanks.

Incandesce answered 8/6, 2017 at 6:34 Comment(2)

Yes, I was trying to be too clever. I mistakenly thought that the function had to be tail-recursive to ensure O(log(n)) stack space to avoid stack overflow. Actually, the even simpler function

double foo(double const *vals, size_t n) {   if (n < 2) return n ? *vals : 0;   size_t half = n/2;   return foo(vals, half) + foo(vals + half, n - half); }

does the pairwise sum correctly, while using reasonable stack space (I believe it is O(log(n)) ). Anyway, we're hijacking this question, so I'll stop here. – Carr 8/6, 2017 at 18:20

@EOF Thanks, that indeed works as advertised. And, as you say, it's easy to read, too. – Incandesce 8/6, 2017 at 19:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

To answer the general question on why the library functions are required to give correctly rounded results:

In many cases, specific knowledge about floating point algorithms can produce considerable improvements to accuracy and/or performance, like in the testcase:

Programmers with advanced knowledge about the accompanying trade-offs may sometimes consider using approximate results where the precision is not critical

Recommended topics

Hot tags