What's the numerically best way to calculate the average
Asked Answered
A

6

26

what's the best way to calculate the average? With this question I want to know which algorithm for calculating the average is the best in a numerical sense. It should have the least rounding errors, should not be sensitive to over- or underflows and so on.

Thank you.


Additional information: incremental approaches preferred since the number of values may not fit into RAM (several parallel calculations on files larger than 4 GB).

Abacist answered 26/9, 2011 at 8:25 Comment(3)
Whoever voted to close as not constructive got it wrong big time. This is an excellent and appropriate question.Sack
Note that the different algorithms presented aren't mutually exclusive. It's perfectly feasible to read 1 MB chunks, sort them, sum them, and then use Kahan summation over all the partial sums.Mcchesney
thank you for all of your comments. They helped me understanding my problem. I'll accept the paper as answer as it provides an analysis of different ways to handle the sum.Abacist
E
10

You can have a look at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.3535 (Nick Higham, "The accuracy of floating point summation", SIAM Journal of Scientific Computation, 1993).

If I remember it correctly, compensated summation (Kahan summation) is good if all numbers are positive, as least as good as sorting them and adding them in ascending order (unless there are very very many numbers). The story is much more complicated if some numbers are positive and some are negative, so that you get cancellation. In that case, there is an argument for adding them in descending order.

Eversole answered 26/9, 2011 at 10:33 Comment(6)
There's always the cheap trick of summing positive and negative numbers separately. In this case, the speed of the algorithm isn't that critical as long as it's O(N); the disk I/O will dominate almost any number of FP ops.Mcchesney
@Mcchesney Why would you want to sum them separately? If you want to minimize round-off error, then the intermediate results should be as small as possible (in absolute value). Summing them separately has the opposite effect.Eversole
As you note yourself, compensated summation is good if all numbers have the same sign.Mcchesney
Compensated summation is still good with positive and negative terms. Its error is cond(S_n)u, where cond(S_n) is the condition number of summation and u is the unit roundoff. Regular summation has error ~ncond(S_n)u. If the sum is ill-conditioned none of the suggestions help.Allopath
On second thought, that idea about separate summation of positive and negative parts still has issues. You get two accurate results, but can still get catastrophic cancellation in the final results when those sub-sums are approximately equal.Mcchesney
And on further thought, if you separate the positive and negative terms, you get two compensation terms from Kahan summation. You shouldn't apply the compensation terms before adding the positive and negative sums together, but afterwards.Mcchesney
M
13

If you want an O(N) algorithm, look at Kahan summation.

Mcchesney answered 26/9, 2011 at 8:29 Comment(4)
Clearly O(N) is going to be faster than the sorting approach. Do you know whether or not Kahan is more or less accurate than sorting before summing?Sack
@David Heffernan: from the wikipedia entry: "Although Kahan's algorithm achieves O(1) error growth for summing n numbers, only slightly worse O(logn) growth can be achieved by pairwise summation: one recursively divides the set of numbers into two halves, sums each half, and then adds the two sums.". I doubt the sorting method method can do O(1).Schramke
@yi_H I don't quite follow the logic there. Pairwise summation as described there doesn't seem to involve sorting. No matter, sorting is clearly very expensive compared to either Kahan or pairwise summation.Sack
@David: depends on the exact summing algorithm you're using after sorting. See citeseerx.ist.psu.edu/viewdoc/… for more details. 100% accuracy is possible (i.e. the only error is due to the finite amount of storage for the result)Mcchesney
E
10

You can have a look at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.3535 (Nick Higham, "The accuracy of floating point summation", SIAM Journal of Scientific Computation, 1993).

If I remember it correctly, compensated summation (Kahan summation) is good if all numbers are positive, as least as good as sorting them and adding them in ascending order (unless there are very very many numbers). The story is much more complicated if some numbers are positive and some are negative, so that you get cancellation. In that case, there is an argument for adding them in descending order.

Eversole answered 26/9, 2011 at 10:33 Comment(6)
There's always the cheap trick of summing positive and negative numbers separately. In this case, the speed of the algorithm isn't that critical as long as it's O(N); the disk I/O will dominate almost any number of FP ops.Mcchesney
@Mcchesney Why would you want to sum them separately? If you want to minimize round-off error, then the intermediate results should be as small as possible (in absolute value). Summing them separately has the opposite effect.Eversole
As you note yourself, compensated summation is good if all numbers have the same sign.Mcchesney
Compensated summation is still good with positive and negative terms. Its error is cond(S_n)u, where cond(S_n) is the condition number of summation and u is the unit roundoff. Regular summation has error ~ncond(S_n)u. If the sum is ill-conditioned none of the suggestions help.Allopath
On second thought, that idea about separate summation of positive and negative parts still has issues. You get two accurate results, but can still get catastrophic cancellation in the final results when those sub-sums are approximately equal.Mcchesney
And on further thought, if you separate the positive and negative terms, you get two compensation terms from Kahan summation. You shouldn't apply the compensation terms before adding the positive and negative sums together, but afterwards.Mcchesney
S
5

Sort the numbers in ascending order of magnitude. Sum them, low magnitude first. Divide by the count.

Sack answered 26/9, 2011 at 8:27 Comment(11)
How does the sorting and adding numbers in ascending order make a difference here?Gurkha
why the sorting? isn't this prone to overflows (the sum might overflow)?Abacist
@Jan It reduces roundoff. Think about it.Sack
@Tobias Is overflow really a problem for you?Sack
@David: yes it can be. It is possible that the calculation is used for several millions of samples.Abacist
@Tobias So what? It's the size of the samples. You'd need the individual values to be order of magnitude 1e300 to be worried by overflow.Sack
@Jan S Roundoff occurs when you add (or subtract) two values of greatly differing magnitude.Sack
@Jan: simple example: one large number (e.g. 10^20) & lots of small numbers (e.g. 1). If you put the 10^20 as first number, adding the 1 will do nothing since it is rounded of - always. This may change the sum significantly.Abacist
Overflow is a valid concern given the edit of the question. When summing a billion doubles, your sum is of course a billion times larger than the average. That's 1E9, which would be an issue for single-precision floats (38-9) but generally isn't for double-precision floats (308-9)Mcchesney
@Mcchesney Yes, I was assuming double precision. You make a good point about single.Sack
Overflow is even more relevant for half-precision floats (used more and more for GPUs with tensor cores).Marylou
I
5

I always use the following pseudocode:

float mean=0.0; // could use doulbe
int n=0;  // could use long

for each x in data:
    ++n;
    mean+=(x-mean)/n;

I don't have formal proofs of its stability but you can see that we won't have problems with numerical overflow, assuming that the data values are well behaved. It's referred to in Knuth's The Art of Computer Programming

Ineducable answered 3/6, 2014 at 19:44 Comment(4)
Note that for float, this will be the less accurate the closer n gets to 2^23. For example a sequence of 10 million values 1.0 followed by 10 million values 2.0 would average to 1.269. This is because (x-mean)/n gets closer to zero, and mean does not change when added.Dorcasdorcea
@Dorcasdorcea if you’re worried about that, you can always shuffle them first.Ineducable
+1 as this is almost as simple as a naiive implementation without the accumlation buildup. Which Knuth book, volume 2?Antinucleon
This blogpost provides reasoning for this method. diego.assencio.com/?index=c34d06f4f4de2375658ed41f70177d59Subdebutante
A
3

Just to add one possible answer for further discussion:

Incrementally calculate the average for each step:

AVG_n = AVG_(n-1) * (n-1)/n + VALUE_n / n

or pairwise combination

AVG_(n_a + n_b) = (n_a * AVG_a + n_b * AVG_b) / (n_a + n_b)

(I hope the formulas are clear enough)

Abacist answered 26/9, 2011 at 14:6 Comment(3)
Admittedly, I have not worked it out myself, but the repeated divisions of the incremental form seems like it would generate more accuracy loss. Part of the issue is that 1/n introduces errors in the least significant bits, so n/n != 1, at least when it is performed as a three step operation (divide-store-multiply). This is minimized if the division is only performed once, but you'd be doing it over GB of data.Zuniga
With this formula you don't run into the risk of a overflow of the datatype.Abacist
Awesome formulas. Here's my implementation of the first one in C#: double average(IList<int> numbers, int avgUpTo1BasedIndex) { return (avgUpTo1BasedIndex == 1) ? numbers[0] : average(numbers, avgUpTo1BasedIndex - 1) * (avgUpTo1BasedIndex - 1) / avgUpTo1BasedIndex + (double)numbers[avgUpTo1BasedIndex - 1] / avgUpTo1BasedIndex; }Baelbeer
G
3

A very late post, but since I don't have enough reputation to comment, @Dave's method is the one used (as at December 2020) by the Gnu Scientific Library.

Here is the code, extracted from mean_source.c:

double FUNCTION (gsl_stats, mean) (const BASE data[], const size_t stride, const size_t size)
{
/* Compute the arithmetic mean of a dataset using the recurrence relation mean_(n) = mean(n-1) + (data[n] - mean(n-1))/(n+1)   */

long double mean = 0;
size_t i;

for (i = 0; i < size; i++)
{
  mean += (data[i * stride] - mean) / (i + 1);
}

return mean;
}

GSL uses the same algorithm to calculate the variance, which is, after all, just a mean of squared differences from a given number.

Gainor answered 20/12, 2020 at 17:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.