In which order should floats be added to get the most precise result?

Asked 14/7, 2011 at 19:44 Answered 29/6, 2017 at 20:43

112

This was a question I was asked at my recent interview and I want to know (I don't actually remember the theory of the numerical analysis, so please help me :)

If we have some function, which accumulates floating-point numbers:

std::accumulate(v.begin(), v.end(), 0.0);

v is a std::vector<float>, for example.

Would it be better to sort these numbers before accumulating them?
Which order would give the most precise answer?

I suspect that sorting the numbers in ascending order would actually make the numerical error less, but unfortunately I can't prove it myself.

P.S. I do realize this probably has nothing to do with real world programming, just being curious.

Introgression answered 14/7, 2011 at 19:44 Comment(12)

This actually has everything to do with real-world programming. However, many applications don't really CARE about the absolute best accuracy of the calculation as long as it's 'pretty close'. Engineering applications? Extremely important. Medical applications? Extremely important. Large-scale statistics? Somewhat less accuracy is acceptable. – Latency 14/7, 2011 at 19:49

I think this qualifies as programming related so yes, it’s welcome here. – Unfold 14/7, 2011 at 19:49

Please don't answer unless you actually know and can point to a page the explains your reasoning in details. There is already so much crap about floating point numbers flying around we don;t want to add to it. If you think you know. STOP. because if you only think you know then you are probably wrong. – Switchback 14/7, 2011 at 19:51

@Zéychin "Engineering applications? Extremely important. Medical applications? Extremely important."??? I think you would be surprised if you knew the truth :) – Welldressed 14/7, 2011 at 19:59

@VJo Alright, to be fair, not all Engineering and Medical applications need as much accuracy as possible. Many do. If a coordinate of a computer-assisted medical tool is off by a few hundredths of a radian, for example, there could be some severe problems. – Latency 14/7, 2011 at 20:16

@Zeychin Absolute error is irrelevant. What is important is relative error. If few hundredths of a radian is 0.001%, then who cares? – Welldressed 14/7, 2011 at 21:30

@Martin: guess we better ask Steve and Andrew to delete their answers then... no supporting pages cited ;-P – Stagecoach 15/7, 2011 at 1:9

This question can actually be extended to any language; this is more of a CPU issue than of a language issue. – Sacristy 15/7, 2011 at 8:13

@Tony: appeal to an authority is one way to do it, "things every programmer needs to know about floating point" or what-have-you. I'm not trying to solve the general problem, "how best to add up floating point numbers", this question is quite specifically looking at the effects of sorting. I'm not sure that an authoritative reference would concern itself with that, since sorting is inadequate to solve the general problem, so it's a bit of a curiosity once you have a killer case to "disprove" it. But if anyone has any good links it certainly wouldn't do any harm to produce them :-) – Sideling 15/7, 2011 at 9:1

@Zéychin no application needs "as much accuracy as possible" because it is possible to use arbitrary precision algorithms running on supercomputer clusters to calculate results accurate to thousands of bits. Every application needs a certain level of accuracy and no more. There is no point trying to work out doses of drugs to more accuracy that can be manufactured, for example. – Tarragona 16/7, 2011 at 0:16

Perhaps I should have been more verbose: As much accuracy as possible given storage restraints/application requirements. If one has a 64-bit floating point number and the order of operations causes the answers to be significantly flawed, then a rearranged algorithm should be considered, or what is the point of using the extend precision if it's being wasted? – Latency 16/7, 2011 at 0:37

I really recommend this reading: "what every computer scientist needs to know about floating point" perso.ens-lyon.fr/jean-michel.muller/goldberg.pdf – Autobus 15/2, 2016 at 13:2

114

Your instinct is basically right, sorting in ascending order (of magnitude) usually improves things somewhat. Consider the case where we're adding single-precision (32 bit) floats, and there are 1 billion values equal to 1 / (1 billion), and one value equal to 1. If the 1 comes first, then the sum will come to 1, since 1 + (1 / 1 billion) is 1 due to loss of precision. Each addition has no effect at all on the total.

If the small values come first, they will at least sum to something, although even then I have 2^30 of them, whereas after 2^25 or so I'm back in the situation where each one individually isn't affecting the total any more. So I'm still going to need more tricks.

That's an extreme case, but in general adding two values of similar magnitude is more accurate than adding two values of very different magnitudes, since you "discard" fewer bits of precision in the smaller value that way. By sorting the numbers, you group values of similar magnitude together, and by adding them in ascending order you give the small values a "chance" of cumulatively reaching the magnitude of the bigger numbers.

Still, if negative numbers are involved it's easy to "outwit" this approach. Consider three values to sum, {1, -1, 1 billionth}. The arithmetically correct sum is 1 billionth, but if my first addition involves the tiny value then my final sum will be 0. Of the 6 possible orders, only 2 are "correct" - {1, -1, 1 billionth} and {-1, 1, 1 billionth}. All 6 orders give results that are accurate at the scale of the largest-magnitude value in the input (0.0000001% out), but for 4 of them the result is inaccurate at the scale of the true solution (100% out). The particular problem you're solving will tell you whether the former is good enough or not.

In fact, you can play a lot more tricks than just adding them in sorted order. If you have lots of very small values, a middle number of middling values, and a small number of large values, then it might be most accurate to first add up all the small ones, then separately total the middling ones, add those two totals together then add the large ones. It's not at all trivial to find the most accurate combination of floating-point additions, but to cope with really bad cases you can keep a whole array of running totals at different magnitudes, add each new value to the total that best matches its magnitude, and when a running total starts to get too big for its magnitude, add it into the next total up and start a new one. Taken to its logical extreme, this process is equivalent to performing the sum in an arbitrary-precision type (so you'd do that). But given the simplistic choice of adding in ascending or descending order of magnitude, ascending is the better bet.

It does have some relation to real-world programming, since there are some cases where your calculation can go very badly wrong if you accidentally chop off a "heavy" tail consisting of a large number of values each of which is too small to individually affect the sum, or if you throw away too much precision from a lot of small values that individually only affect the last few bits of the sum. In cases where the tail is negligible anyway you probably don't care. For example if you're only adding together a small number of values in the first place and you're only using a few significant figures of the sum.

Sideling answered 14/7, 2011 at 19:49 Comment(20)

+1 for explanation. This is somewhat counter-intuitive since addition is usually numerically stable (unlike subtraction and division). – Unfold 14/7, 2011 at 19:51

@Konrad, it may be numerically stable, but it isn't precise given different magnitudes of operands :) – Geek 14/7, 2011 at 19:57

What if the array contains also -1? – Godson 14/7, 2011 at 20:4

@6502: they're sorted in order of magnitude, so the -1 comes at the end. If the true value of the total is of magnitude 1, then that's fine. If you're adding together three values: 1 / billion, 1 and -1, then, you'd get 0, at which point you have to answer the interesting practical question - do you need an answer that's accurate at the scale of the true sum, or do you only need an answer that's accurate at the scale of the largest values? For some practical applications, the latter is good enough, but when it isn't you need a more sophisticated approach. Quantum physics uses renormalization. – Sideling 14/7, 2011 at 20:19

If you're going to stick with this simple scheme, I would always add the two numbers with the lowest magnitude and reinsert the sum in the set. (Well, probably a merge sort would work best here. You could use the portion of the array containing the previously summed numbers as a working area for the partial sums.) – Zionism 14/7, 2011 at 20:35

Very informative post. I thought I knew about floating point, but this never occurred to me, although it makes perfect sense. Must add this to my article on floats. <g> – Garrotte 14/7, 2011 at 20:36

@Rudy: note though that I got my example wrong on the first attempt, despite having the luxury of choosing the numbers myself... – Sideling 14/7, 2011 at 20:39

Nice answer. Would you mind explaining the relationship between the magnitude of the number and the rounding error? I seem to recall that there are many more floating point numbers crammed in between 0 and 1, but there are fewer numbers as you go higher towards +MAX_FLOAT, and therefore the rounding error would be larger for those larger numbers. – Bainite 14/7, 2011 at 20:40

I tried out the example above. My conclusions are too long for a comment, so I have added an answer... – Brushwork 14/7, 2011 at 20:42

@Kevin Panko: The simple version is that a single-precision float has 24 binary digits, the greatest of which is the largest set bit in the number. So if you add together two numbers that differ in magnitude by more than 2^24, you suffer total loss of the smaller value, and if they differ in magnitude by a smaller degree then you lose a corresponding number of bits of accuracy of the smaller number. – Sideling 14/7, 2011 at 20:43

@Steve: I didn't really look at the code. I found the idea of adding many small numbers to slowly get into the realm of the larger ones a good advice. Of course that only works if there are many small ones, but still, solid general advice. – Garrotte 14/7, 2011 at 20:45

Just thought of something: instead of ordering in order of magnitude, simply sort the numbers. Then split the numbers at 0: add the positive numbers from 0 on, add the negative numbers also from 0 on (i.e. backward), and only then add the two results you obtained. This avoids subtracting two almost equal numbers (or adding a positive and a negative value of almost same magnitude) and obtaining a very small value as intermediate result. IOW, the way you sort the numbers in order of magnitude makes the chance two numbers of equal magnitude but different sign are added rather big. – Garrotte 14/7, 2011 at 20:57

@Steve With just positive numbers, it's really easy, since you'll only need the original sorted list and a second queue (which you may be allowed to carve out of the original list), but I'm sure it can be done with numbers of mixed signs too. – Zionism 14/7, 2011 at 20:57

@Rudy actually adding numbers of mixed signs is harder than I thought, just imagine your numbers were MAX_ODD_FLOAT + 1, 1.25, and -0.875, then you can get three different answers depending on the order in which you add them... – Zionism 14/7, 2011 at 21:0

@Rudy: that falls somewhere on the sliding scale from "rubbish" to "use a bigfloat library and set the precision to 2^11 + 52". My reduction of 6502's case, {1 billionth, 1, -1} still kills it, since the positive sum is 1, the negative sum is -1, hence the total is 0. It's all a trade-off of working space and time vs accuracy, and what kinds of nasty inputs you have to deal with. – Sideling 14/7, 2011 at 21:2

@Steve: sure, but the tiny number is killed anyway, when there are nearly as many big numbers as there are tiny ones. My gut feeling tells me that you should only introduce sign at the last moment. I'd have to think about it a little longer to produce proof, though. Hey, I'm a dentist, not a mathematician. <g> – Garrotte 14/7, 2011 at 21:7

@Rudy: I don't know if you can fairly say that it's "killed anyway": if by some combination of cleverness and luck we add the values in order {1, -1, 1 billionth} then we're fine. For that case, of the 6 possible orders 2 are good and 4 are bad. So the question becomes, which orders are we "allowed" to consider? The question points to ascending or descending, which I've brazenly changed to ascending/descending magnitude, you've suggested a scheme for two partial sums, Daniel Pryden another scheme with two running values -- it's all heuristics until we just throw super-precision at it. – Sideling 14/7, 2011 at 21:14

@Steve: I guess you're right. Fact is that sorting does make a difference and your explanation made that very clear. – Garrotte 14/7, 2011 at 21:23

@Jerry: yep, that's what Neil said, although he didn't specify a priority queue. It doesn't maximize accuracy, it's killed by {1 billionth, 1, -1}, a case constructed specifically to have the property that if your first addition involves the tiny value, you lose. – Sideling 15/7, 2011 at 8:49

@Steve: sorry, I didn't look back quite far enough, and missed that. – Hypercriticism 15/7, 2011 at 14:18

There is also an algorithm designed for this kind of accumulation operation, called Kahan Summation, that you should probably be aware of.

According to Wikipedia,

The Kahan summation algorithm (also known as compensated summation) significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach. This is done by keeping a separate running compensation (a variable to accumulate small errors).

In pseudocode, the algorithm is:
function kahanSum(input)
 var sum = input[1]
 var c = 0.0          //A running compensation for lost low-order bits.
 for i = 2 to input.length
  y = input[i] - c    //So far, so good: c is zero.
  t = sum + y         //Alas, sum is big, y small, so low-order digits of y are lost.
  c = (t - sum) - y   //(t - sum) recovers the high-order part of y; subtracting y recovers -(low part of y)
  sum = t             //Algebraically, c should always be zero. Beware eagerly optimising compilers!
 next i               //Next time around, the lost low part will be added to y in a fresh attempt.
return sum

Dimidiate answered 14/7, 2011 at 20:21 Comment(5)

+1 lovely addition to this thread. Any compiler that "eagerly optimizes" those statements should be banned. – Gavrila 14/7, 2011 at 20:38

It's a simple method to almost double the precision, by using two summation variables sum and c of differing magnitude. It can be trivially extended to N variables. – Skirt 15/7, 2011 at 8:17

@ChrisA. well you can explicitly control this on all compilers that count (e.g. via -ffast-math on GCC). – Unfold 15/7, 2011 at 8:58

@Konrad Rudolph thanks for pointing out that this is a possible optimization with -ffast-math. What I learned from this discussion and this link, is that if you care about numerical accuracy you should probably avoid using -ffast-math but that in many applications where you may be CPU-bound but don't care about precise numerical computations, (game programming for instance), -ffast-math is reasonable to use. Thus, I'd like to ammend my strongly worded "banned" comment. – Gavrila 15/7, 2011 at 16:48

Using double precision variables for sum, c, t, y will help. You also need to add sum -= c at before return sum. – Coray 30/1, 2019 at 18:57

I tried out the extreme example in the answer supplied by Steve Jessop.

#include <iostream>
#include <iomanip>
#include <cmath>

int main()
{
    long billion = 1000000000;
    double big = 1.0;
    double small = 1e-9;
    double expected = 2.0;

    double sum = big;
    for (long i = 0; i < billion; ++i)
        sum += small;
    std::cout << std::scientific << std::setprecision(1) << big << " + " << billion << " * " << small << " = " <<
        std::fixed << std::setprecision(15) << sum <<
        "    (difference = " << std::fabs(expected - sum) << ")" << std::endl;

    sum = 0;
    for (long i = 0; i < billion; ++i)
        sum += small;
    sum += big;
    std::cout  << std::scientific << std::setprecision(1) << billion << " * " << small << " + " << big << " = " <<
        std::fixed << std::setprecision(15) << sum <<
        "    (difference = " << std::fabs(expected - sum) << ")" << std::endl;

    return 0;
}

I got the following result:

1.0e+00 + 1000000000 * 1.0e-09 = 2.000000082740371    (difference = 0.000000082740371)
1000000000 * 1.0e-09 + 1.0e+00 = 1.999999992539933    (difference = 0.000000007460067)

The error in the first line is more than ten times bigger in the second.

If I change the doubles to floats in the code above, I get:

1.0e+00 + 1000000000 * 1.0e-09 = 1.000000000000000    (difference = 1.000000000000000)
1000000000 * 1.0e-09 + 1.0e+00 = 1.031250000000000    (difference = 0.968750000000000)

Neither answer is even close to 2.0 (but the second is slightly closer).

Using the Kahan summation (with doubles) as described by Daniel Pryden:

#include <iostream>
#include <iomanip>
#include <cmath>

int main()
{
    long billion = 1000000000;
    double big = 1.0;
    double small = 1e-9;
    double expected = 2.0;

    double sum = big;
    double c = 0.0;
    for (long i = 0; i < billion; ++i) {
        double y = small - c;
        double t = sum + y;
        c = (t - sum) - y;
        sum = t;
    }

    std::cout << "Kahan sum  = " << std::fixed << std::setprecision(15) << sum <<
        "    (difference = " << std::fabs(expected - sum) << ")" << std::endl;

    return 0;
}

I get exactly 2.0:

Kahan sum  = 2.000000000000000    (difference = 0.000000000000000)

And even if I change the doubles to floats in the code above, I get:

Kahan sum  = 2.000000000000000    (difference = 0.000000000000000)

It would seem that Kahan is the way to go!

Brushwork answered 14/7, 2011 at 20:41 Comment(3)

My "big" value is equal to 1, not 1e9. Your second answer, added in increasing size order, is mathematically correct (1 billion, plus a billion billionths, is 1 billion and 1), although more by luck any any general soundness of the method :-) Note that double doesn't suffer bad loss of precision in adding together a billion billionths, since it has 52 significant bits, whereas IEEE float only has 24 and would. – Sideling 14/7, 2011 at 20:47

@Steve, my error, apologies. I have updated the example code to what you intended. – Brushwork 14/7, 2011 at 21:12

Kahan still has limited precision, but to construct a killer case you need both the main sum and the error accumulator c to contain values much bigger than the next summand. This means the summand is much, much smaller than the main sum, so there are going to have to be an awful lot of them to add up to much. Especially with double arithmetic. – Sideling 15/7, 2011 at 9:11

There is a class of algorithms that solve this exact problem, without the need to sort or otherwise re-order the data.

In other words, the summation can be done in one pass over the data. This also makes such algorithms applicable in situations where the dataset is not known in advance, e.g. if the data arrives in real time and the running sum needs to be maintained.

Here is the abstract of a recent paper:

We present a novel, online algorithm for exact summation of a stream of floating-point numbers. By “online” we mean that the algorithm needs to see only one input at a time, and can take an arbitrary length input stream of such inputs while requiring only constant memory. By “exact” we mean that the sum of the internal array of our algorithm is exactly equal to the sum of all the inputs, and the returned result is the correctly-rounded sum. The proof of correctness is valid for all inputs (including nonnormalized numbers but modulo intermediate overflow), and is independent of the number of summands or the condition number of the sum. The algorithm asymptotically needs only 5 FLOPs per summand, and due to instruction-level parallelism runs only about 2--3 times slower than the obvious, fast-but-dumb “ordinary recursive summation” loop when the number of summands is greater than 10,000. Thus, to our knowledge, it is the fastest, most accurate, and most memory efficient among known algorithms. Indeed, it is difficult to see how a faster algorithm or one requiring significantly fewer FLOPs could exist without hardware improvements. An application for a large number of summands is provided.

Source: Algorithm 908: Online Exact Summation of Floating-Point Streams.

Diphtheria answered 15/7, 2011 at 10:21 Comment(1)

@Inverse: There are still brick-and-mortar libraries around. Alternatively, buying the PDF online costs $5-$15 (depending on whether you're an ACM member). Lastly, DeepDyve seem to be offering to lend the paper for 24 hours for $2.99 (if you're new to DeepDyve, you might even be able to get it for free as part of their free trial): deepdyve.com/lp/acm/… – Diphtheria 18/7, 2011 at 15:19

Building on Steve's answer of first sorting the numbers in ascending order, I'd introduce two more ideas:

Decide on the difference in exponent of two numbers above which you might decide that you would lose too much precision.
Then add the numbers up in order until the exponent of the accumulator is too large for the next number, then put the accumulator onto a temporary queue and start the accumulator with the next number. Continue until you exhaust the original list.

You repeat the process with the temporary queue (having sorted it) and with a possibly larger difference in exponent.

I think this will be quite slow if you have to calculate exponents all the time.

I had a quick go with a program and the result was 1.99903

Frump answered 14/7, 2011 at 22:12 Comment(0)

I think you can do better than sorting the numbers before you accumulate them, because during the process of accumulation, the accumulator gets bigger and bigger. If you have a large amount of similar numbers, you will start to lose precision quickly. Here is what I would suggest instead:

while the list has multiple elements
    remove the two smallest elements from the list
    add them and put the result back in
the single element in the list is the result

Of course this algorithm will be most efficient with a priority queue instead of a list. C++ code:

template <typename Queue>
void reduce(Queue& queue)
{
    typedef typename Queue::value_type vt;
    while (queue.size() > 1)
    {
        vt x = queue.top();
        queue.pop();
        vt y = queue.top();
        queue.pop();
        queue.push(x + y);
    }
}

driver:

#include <iterator>
#include <queue>

template <typename Iterator>
typename std::iterator_traits<Iterator>::value_type
reduce(Iterator begin, Iterator end)
{
    typedef typename std::iterator_traits<Iterator>::value_type vt;
    std::priority_queue<vt> positive_queue;
    positive_queue.push(0);
    std::priority_queue<vt> negative_queue;
    negative_queue.push(0);
    for (; begin != end; ++begin)
    {
        vt x = *begin;
        if (x < 0)
        {
            negative_queue.push(x);
        }
        else
        {
            positive_queue.push(-x);
        }
    }
    reduce(positive_queue);
    reduce(negative_queue);
    return negative_queue.top() - positive_queue.top();
}

The numbers in the queue are negative because top yields the largest number, but we want the smallest. I could have provided more template arguments to the queue, but this approach seems simpler.

Blaspheme answered 15/7, 2011 at 7:28 Comment(0)

This doesn't quite answer your question, but a clever thing to do is to run the sum twice, once with rounding mode "round up" and once with "round down". Compare the two answers, and you know /how/ inaccurate your results are, and if you therefore need to use a cleverer summing strategy. Unfortunately, most languages don't make changing the floating point rounding mode as easy as it should be, because people don't know that it's actually useful in everyday calculations.

Take a look at Interval arithmetic where you do all maths like this, keeping highest and lowest values as you go. It leads to some interesting results and optimisations.

Tarragona answered 16/7, 2011 at 0:3 Comment(0)

The simplest sort that improves accuracy is to sort by the ascending absolute value. That lets the smallest magnitude values have a chance to accumulate or cancel before interacting with larger magnitude values that have would trigger a loss of precision.

That said, you can do better by tracking multiple non-overlapping partial sums. Here is a paper describing the technique and presenting a proof-of-accuracy: www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps

That algorithm and other approaches to exact floating point summation are implemented in simple Python at: http://code.activestate.com/recipes/393090/ At least two of those can be trivially converted to C++.

Monobasic answered 25/5, 2013 at 23:44 Comment(0)

For IEEE 754 single or double precision or known format numbers, another alternative is to use an array of numbers (passed by caller, or in a class for C++) indexed by the exponent. When adding numbers into the array, only numbers with the same exponent are added (until an empty slot is found and the number stored). When a sum is called for, the array is summed from smallest to largest to minimize truncation. Single precision example:

/* clear array */
void clearsum(float asum[256])
{
size_t i;
    for(i = 0; i < 256; i++)
        asum[i] = 0.f;
}

/* add a number into array */
void addtosum(float f, float asum[256])
{
size_t i;
    while(1){
        /* i = exponent of f */
        i = ((size_t)((*(unsigned int *)&f)>>23))&0xff;
        if(i == 0xff){          /* max exponent, could be overflow */
            asum[i] += f;
            return;
        }
        if(asum[i] == 0.f){     /* if empty slot store f */
            asum[i] = f;
            return;
        }
        f += asum[i];           /* else add slot to f, clear slot */
        asum[i] = 0.f;          /* and continue until empty slot */
    }
}

/* return sum from array */
float returnsum(float asum[256])
{
float sum = 0.f;
size_t i;
    for(i = 0; i < 256; i++)
        sum += asum[i];
    return sum;
}

double precision example:

/* clear array */
void clearsum(double asum[2048])
{
size_t i;
    for(i = 0; i < 2048; i++)
        asum[i] = 0.;
}

/* add a number into array */
void addtosum(double d, double asum[2048])
{
size_t i;
    while(1){
        /* i = exponent of d */
        i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;
        if(i == 0x7ff){         /* max exponent, could be overflow */
            asum[i] += d;
            return;
        }
        if(asum[i] == 0.){      /* if empty slot store d */
            asum[i] = d;
            return;
        }
        d += asum[i];           /* else add slot to d, clear slot */
        asum[i] = 0.;           /* and continue until empty slot */
    }
}

/* return sum from array */
double returnsum(double asum[2048])
{
double sum = 0.;
size_t i;
    for(i = 0; i < 2048; i++)
        sum += asum[i];
    return sum;
}

Subglacial answered 7/9, 2015 at 19:23 Comment(2)

This sounds somewhat like the method of Malcolm 1971 or, more so, its variant that uses the exponent by Demmel and Hida ("Algorithm 3"). There's another algorithm out there that does a carry-based loop like yours, but I can't find it at the moment. – Hassock 16/8, 2019 at 22:34

@Hassock - the concept is similar to bottom up merge sort for linked list, which also uses a small array, where array[i] points to list with 2^i nodes. I don't know how far back this goes. In my case, it was self discovery back in the 1970's. – Subglacial 17/8, 2019 at 8:33

-1

Your floats should be added in double precision. That will give you more additional precision than any other technique can. For a bit more precision and significantly more speed, you can create say four sums, and add them up at the end.

If you are adding double precision numbers, use long double for the sum - however, this will only have a positive effect in implementations where long double actually has more precision than double (typically x86, PowerPC depending on compiler settings).

Gabriella answered 14/8, 2014 at 8:2 Comment(2)

“That will give you more additional precision than any other technique can” Do you realize that your answer comes more than one year after an earlier late answer that described how to use exact summation? – Acetylide 14/8, 2014 at 8:7

The "long double" type is horrible and you shouldn't use it. – Schulz 26/4, 2015 at 23:51

-1

Regarding sorting, it seems to me that if you expect cancellation then the numbers should be added in descending order of magnitude, not ascending. For instance:

((-1 + 1) + 1e-20) will give 1e-20

but

((1e-20 + 1) - 1) will give 0

In the first equation that two large numbers are cancelled out, whereas in the second the 1e-20 term gets lost when added to 1, since there is not enough precision to retain it.

Also, pairwise summation is pretty decent for summing lots of numbers.

Guanajuato answered 29/6, 2017 at 20:43 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags