What's the numerically best way to calculate the average

A

6

26

what's the best way to calculate the average? With this question I want to know which algorithm for calculating the average is the best in a numerical sense. It should have the least rounding errors, should not be sensitive to over- or underflows and so on.

Thank you.

Additional information: incremental approaches preferred since the number of values may not fit into RAM (several parallel calculations on files larger than 4 GB).

Abacist answered 26/9, 2011 at 8:25 Comment(3)

Whoever voted to close as not constructive got it wrong big time. This is an excellent and appropriate question. – Sack 26/9, 2011 at 8:54

Note that the different algorithms presented aren't mutually exclusive. It's perfectly feasible to read 1 MB chunks, sort them, sum them, and then use Kahan summation over all the partial sums. – Mcchesney 26/9, 2011 at 10:35

thank you for all of your comments. They helped me understanding my problem. I'll accept the paper as answer as it provides an analysis of different ways to handle the sum. – Abacist 28/9, 2011 at 7:21

E

10

You can have a look at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.3535 (Nick Higham, "The accuracy of floating point summation", SIAM Journal of Scientific Computation, 1993).

If I remember it correctly, compensated summation (Kahan summation) is good if all numbers are positive, as least as good as sorting them and adding them in ascending order (unless there are very very many numbers). The story is much more complicated if some numbers are positive and some are negative, so that you get cancellation. In that case, there is an argument for adding them in descending order.

Eversole answered 26/9, 2011 at 10:33 Comment(6)

There's always the cheap trick of summing positive and negative numbers separately. In this case, the speed of the algorithm isn't that critical as long as it's O(N); the disk I/O will dominate almost any number of FP ops. – Mcchesney 26/9, 2011 at 11:1

@Mcchesney Why would you want to sum them separately? If you want to minimize round-off error, then the intermediate results should be as small as possible (in absolute value). Summing them separately has the opposite effect. – Eversole 28/9, 2011 at 8:33

As you note yourself, compensated summation is good if all numbers have the same sign. – Mcchesney 29/9, 2011 at 9:19

Compensated summation is still good with positive and negative terms. Its error is cond(S_n)u, where cond(S_n) is the condition number of summation and u is the unit roundoff. Regular summation has error ~ncond(S_n)u. If the sum is ill-conditioned none of the suggestions help. – Allopath 7/5, 2020 at 22:51

On second thought, that idea about separate summation of positive and negative parts still has issues. You get two accurate results, but can still get catastrophic cancellation in the final results when those sub-sums are approximately equal. – Mcchesney 8/3, 2021 at 8:52

And on further thought, if you separate the positive and negative terms, you get two compensation terms from Kahan summation. You shouldn't apply the compensation terms before adding the positive and negative sums together, but afterwards. – Mcchesney 8/2, 2022 at 11:4

M

13

If you want an O(N) algorithm, look at Kahan summation.

Mcchesney answered 26/9, 2011 at 8:29 Comment(4)

Clearly O(N) is going to be faster than the sorting approach. Do you know whether or not Kahan is more or less accurate than sorting before summing? – Sack 26/9, 2011 at 8:48

@David Heffernan: from the wikipedia entry: "Although Kahan's algorithm achieves O(1) error growth for summing n numbers, only slightly worse O(logn) growth can be achieved by pairwise summation: one recursively divides the set of numbers into two halves, sums each half, and then adds the two sums.". I doubt the sorting method method can do O(1). – Schramke 26/9, 2011 at 9:1

@yi_H I don't quite follow the logic there. Pairwise summation as described there doesn't seem to involve sorting. No matter, sorting is clearly very expensive compared to either Kahan or pairwise summation. – Sack 26/9, 2011 at 9:6

@David: depends on the exact summing algorithm you're using after sorting. See citeseerx.ist.psu.edu/viewdoc/… for more details. 100% accuracy is possible (i.e. the only error is due to the finite amount of storage for the result) – Mcchesney 26/9, 2011 at 10:46

E

10

You can have a look at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.3535 (Nick Higham, "The accuracy of floating point summation", SIAM Journal of Scientific Computation, 1993).

If I remember it correctly, compensated summation (Kahan summation) is good if all numbers are positive, as least as good as sorting them and adding them in ascending order (unless there are very very many numbers). The story is much more complicated if some numbers are positive and some are negative, so that you get cancellation. In that case, there is an argument for adding them in descending order.

Eversole answered 26/9, 2011 at 10:33 Comment(6)

There's always the cheap trick of summing positive and negative numbers separately. In this case, the speed of the algorithm isn't that critical as long as it's O(N); the disk I/O will dominate almost any number of FP ops. – Mcchesney 26/9, 2011 at 11:1

@Mcchesney Why would you want to sum them separately? If you want to minimize round-off error, then the intermediate results should be as small as possible (in absolute value). Summing them separately has the opposite effect. – Eversole 28/9, 2011 at 8:33

As you note yourself, compensated summation is good if all numbers have the same sign. – Mcchesney 29/9, 2011 at 9:19

Compensated summation is still good with positive and negative terms. Its error is cond(S_n)u, where cond(S_n) is the condition number of summation and u is the unit roundoff. Regular summation has error ~ncond(S_n)u. If the sum is ill-conditioned none of the suggestions help. – Allopath 7/5, 2020 at 22:51

On second thought, that idea about separate summation of positive and negative parts still has issues. You get two accurate results, but can still get catastrophic cancellation in the final results when those sub-sums are approximately equal. – Mcchesney 8/3, 2021 at 8:52

And on further thought, if you separate the positive and negative terms, you get two compensation terms from Kahan summation. You shouldn't apply the compensation terms before adding the positive and negative sums together, but afterwards. – Mcchesney 8/2, 2022 at 11:4

S

5

Sort the numbers in ascending order of magnitude. Sum them, low magnitude first. Divide by the count.

Sack answered 26/9, 2011 at 8:27 Comment(11)

How does the sorting and adding numbers in ascending order make a difference here? – Gurkha 26/9, 2011 at 8:29

why the sorting? isn't this prone to overflows (the sum might overflow)? – Abacist 26/9, 2011 at 8:29

@Jan It reduces roundoff. Think about it. – Sack 26/9, 2011 at 8:30

@Tobias Is overflow really a problem for you? – Sack 26/9, 2011 at 8:31

@David: yes it can be. It is possible that the calculation is used for several millions of samples. – Abacist 26/9, 2011 at 8:46

@Tobias So what? It's the size of the samples. You'd need the individual values to be order of magnitude 1e300 to be worried by overflow. – Sack 26/9, 2011 at 8:47

@Jan S Roundoff occurs when you add (or subtract) two values of greatly differing magnitude. – Sack 26/9, 2011 at 8:49

@Jan: simple example: one large number (e.g. 10^20) & lots of small numbers (e.g. 1). If you put the 10^20 as first number, adding the 1 will do nothing since it is rounded of - always. This may change the sum significantly. – Abacist 26/9, 2011 at 8:52

Overflow is a valid concern given the edit of the question. When summing a billion doubles, your sum is of course a billion times larger than the average. That's 1E9, which would be an issue for single-precision floats (38-9) but generally isn't for double-precision floats (308-9) – Mcchesney 26/9, 2011 at 10:52

@Mcchesney Yes, I was assuming double precision. You make a good point about single. – Sack 26/9, 2011 at 10:54

Overflow is even more relevant for half-precision floats (used more and more for GPUs with tensor cores). – Marylou 13/5, 2019 at 11:22

I

5

I always use the following pseudocode:

float mean=0.0; // could use doulbe
int n=0;  // could use long

for each x in data:
    ++n;
    mean+=(x-mean)/n;

I don't have formal proofs of its stability but you can see that we won't have problems with numerical overflow, assuming that the data values are well behaved. It's referred to in Knuth's The Art of Computer Programming

Ineducable answered 3/6, 2014 at 19:44 Comment(4)

Note that for float, this will be the less accurate the closer n gets to 2^23. For example a sequence of 10 million values 1.0 followed by 10 million values 2.0 would average to 1.269. This is because (x-mean)/n gets closer to zero, and mean does not change when added. – Dorcasdorcea 6/3, 2021 at 15:23

@Dorcasdorcea if you’re worried about that, you can always shuffle them first. – Ineducable 6/3, 2021 at 22:48

+1 as this is almost as simple as a naiive implementation without the accumlation buildup. Which Knuth book, volume 2? – Antinucleon 14/4, 2021 at 20:59

This blogpost provides reasoning for this method. diego.assencio.com/?index=c34d06f4f4de2375658ed41f70177d59 – Subdebutante 18/8, 2021 at 17:45

A

3

Just to add one possible answer for further discussion:

Incrementally calculate the average for each step:

AVG_n = AVG_(n-1) * (n-1)/n + VALUE_n / n

or pairwise combination

AVG_(n_a + n_b) = (n_a * AVG_a + n_b * AVG_b) / (n_a + n_b)

(I hope the formulas are clear enough)

Abacist answered 26/9, 2011 at 14:6 Comment(3)

Admittedly, I have not worked it out myself, but the repeated divisions of the incremental form seems like it would generate more accuracy loss. Part of the issue is that 1/n introduces errors in the least significant bits, so n/n != 1, at least when it is performed as a three step operation (divide-store-multiply). This is minimized if the division is only performed once, but you'd be doing it over GB of data. – Zuniga 6/10, 2011 at 16:39

With this formula you don't run into the risk of a overflow of the datatype. – Abacist 8/8, 2012 at 10:3

Awesome formulas. Here's my implementation of the first one in C#:

double average(IList<int> numbers, int avgUpTo1BasedIndex) { return (avgUpTo1BasedIndex == 1) ? numbers[0] : average(numbers, avgUpTo1BasedIndex - 1) * (avgUpTo1BasedIndex - 1) / avgUpTo1BasedIndex + (double)numbers[avgUpTo1BasedIndex - 1] / avgUpTo1BasedIndex; }

– Baelbeer 24/10, 2014 at 20:0

G

3

A very late post, but since I don't have enough reputation to comment, @Dave's method is the one used (as at December 2020) by the Gnu Scientific Library.

Here is the code, extracted from mean_source.c:

double FUNCTION (gsl_stats, mean) (const BASE data[], const size_t stride, const size_t size)
{
/* Compute the arithmetic mean of a dataset using the recurrence relation mean_(n) = mean(n-1) + (data[n] - mean(n-1))/(n+1)   */

long double mean = 0;
size_t i;

for (i = 0; i < size; i++)
{
  mean += (data[i * stride] - mean) / (i + 1);
}

return mean;
}

GSL uses the same algorithm to calculate the variance, which is, after all, just a mean of squared differences from a given number.

Gainor answered 20/12, 2020 at 17:28 Comment(0)

Recommended topics

Hot tags