mean from pandas and numpy differ
Asked Answered
D

2

33

I have a MEMS IMU on which I've been collecting data and I'm using pandas to get some statistical data from it. There are 6 32-bit floats collected each cycle. Data rates are fixed for a given collection run. The data rates vary between 100Hz and 1000Hz and the collection times run as long as 72 hours. The data is saved in a flat binary file. I read the data this way:

import numpy as np
import pandas as pd
dataType=np.dtype([('a','<f4'),('b','<f4'),('c','<f4'),('d','<f4'),('e','<f4'),('e','<f4')])
df=pd.DataFrame(np.fromfile('FILENAME',dataType))
df['c'].mean()
-9.880581855773926
x=df['c'].values
x.mean()
-9.8332081

-9.833 is the correct result. I can create a similar result that someone should be able to repeat this way:

import numpy as np
import pandas as pd
x=np.random.normal(-9.8,.05,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-9.859579086303711
x.mean()
-9.8000648778888628

I've repeated this on linux and windows, on AMD and Intel processors, in Python 2.7 and 3.5. I'm stumped. What am I doing wrong? And get this:

x=np.random.normal(-9.,.005,size=900000)
df=pd.DataFrame(x,dtype='float32',columns=['x'])
df['x'].mean()
-8.999998092651367
x.mean()
-9.0000075889406528

I could accept this difference. It's at the limit of the precision of 32 bit floats.

NEVERMIND. I wrote this on Friday and the solution hit me this morning. It is a floating point precision problem exacerbated by the large amount of data. I needed to convert the data into 64 bit float on the creation of the dataframe this way:

df=pd.DataFrame(np.fromfile('FILENAME',dataType),dtype='float64')

I'll leave the post should anyone else run into a similar issue.

Divert answered 29/10, 2018 at 9:19 Comment(4)
I can't reproduce your first check, I get float32-sized errors. Note that your x contains doubles but your df contains floats. That will always give you a difference, but not as large as the original one. Any chance that you have missing values that messes with how the mean is computed?Duce
Part of the issue is that Pandas is using a poor algorithm to compute the mean; eventually, as the sum accumulates, a value close to -9.8 is being repeatedly added to something bigger than 2**23, and limited float32 resolution means that the actual sum changes by exactly -10.0 for most of the random samples. Use of pairwise summation or Kahan summation instead of a simple accumulating sum would have improved the result greatly here. But yes, computing the mean in double precision is the obvious quick fix.Scholarship
@MarkDickinson, Why, then, does the problem not exhibit itself with df['x'].sum() / len(df.index), which gives the correct result even with float32?Snifter
@jpp: Good question. I think you'd have to ask the Pandas authors. NumPy does use pairwise summation for its sum operations in some (but not all) circumstances; it's possible that for whatever reason this particular use of df['x'].sum() ends up in one of those NumPy cases.Scholarship
J
24

Short version:

The reason it's different is because pandas uses bottleneck (if it's installed) when calling the mean operation, as opposed to just relying on numpy. bottleneck is presumably used since it appears to be faster than numpy (at least on my machine), but at the cost of precision. They happen to match for the 64 bit version, but differ in 32 bit land (which is the interesting part).

Long version:

It's extremely difficult to tell what's going on just by inspecting the source code of these modules (they're quite complex, even for simple computations like mean, turns out numerical computing is hard). Best to use the debugger to avoid brain-compiling and those types of mistakes. The debugger won't make a mistake in logic, it'll tell you exactly what's going on.

Here's some of my stack trace (values differ slightly since no seed for RNG):

Can reproduce (Windows):

>>> import numpy as np; import pandas as pd
>>> x=np.random.normal(-9.,.005,size=900000)
>>> df=pd.DataFrame(x,dtype='float32',columns=['x'])
>>> df['x'].mean()
-9.0
>>> x.mean()
-9.0000037501099754
>>> x.astype(np.float32).mean()
-9.0000029

Nothing extraordinary going on with numpy's version. It's the pandas version that's a little wacky.

Let's have a look inside df['x'].mean():

>>> def test_it_2():
...   import pdb; pdb.set_trace()
...   df['x'].mean()
>>> test_it_2()
... # Some stepping/poking around that isn't important
(Pdb) l
2307
2308            if we have an ndarray as a value, then simply perform the operation,
2309            otherwise delegate to the object
2310
2311            """
2312 ->         delegate = self._values
2313            if isinstance(delegate, np.ndarray):
2314                # Validate that 'axis' is consistent with Series's single axis.
2315                self._get_axis_number(axis)
2316                if numeric_only:
2317                    raise NotImplementedError('Series.{0} does not implement '
(Pdb) delegate.dtype
dtype('float32')
(Pdb) l
2315                self._get_axis_number(axis)
2316                if numeric_only:
2317                    raise NotImplementedError('Series.{0} does not implement '
2318                                              'numeric_only.'.format(name))
2319                with np.errstate(all='ignore'):
2320 ->                 return op(delegate, skipna=skipna, **kwds)
2321
2322            return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
2323                                    numeric_only=numeric_only,
2324                                    filter_type=filter_type, **kwds)

So we found the trouble spot, but now things get kind of weird:

(Pdb) op
<function nanmean at 0x000002CD8ACD4488>
(Pdb) op(delegate)
-9.0
(Pdb) delegate_64 = delegate.astype(np.float64)
(Pdb) op(delegate_64)
-9.000003749978807
(Pdb) delegate.mean()
-9.0000029
(Pdb) delegate_64.mean()
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float64)
-9.0000037499788075
(Pdb) np.nanmean(delegate, dtype=np.float32)
-9.0000029

Note that delegate.mean() and np.nanmean output -9.0000029 with type float32, not -9.0 as pandas nanmean does. With a bit of poking around, you can find the source to pandas nanmean in pandas.core.nanops. Interestingly, it actually appears like it should be matching numpy at first. Let's have a look at pandas nanmean:

(Pdb) import inspect
(Pdb) src = inspect.getsource(op).split("\n")
(Pdb) for line in src: print(line)
@disallow('M8')
@bottleneck_switch()
def nanmean(values, axis=None, skipna=True):
    values, mask, dtype, dtype_max = _get_values(values, skipna, 0)

    dtype_sum = dtype_max
    dtype_count = np.float64
    if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype):
        dtype_sum = np.float64
    elif is_float_dtype(dtype):
        dtype_sum = dtype
        dtype_count = dtype
    count = _get_counts(mask, axis, dtype=dtype_count)
    the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))

    if axis is not None and getattr(the_sum, 'ndim', False):
        the_mean = the_sum / count
        ct_mask = count == 0
        if ct_mask.any():
            the_mean[ct_mask] = np.nan
    else:
        the_mean = the_sum / count if count > 0 else np.nan

    return _wrap_results(the_mean, dtype)

Here's a (short) version of the bottleneck_switch decorator:

import bottleneck as bn
...
class bottleneck_switch(object):

    def __init__(self, **kwargs):
        self.kwargs = kwargs

    def __call__(self, alt):
        bn_name = alt.__name__

        try:
            bn_func = getattr(bn, bn_name)
        except (AttributeError, NameError):  # pragma: no cover
            bn_func = None
    ...

                if (_USE_BOTTLENECK and skipna and
                        _bn_ok_dtype(values.dtype, bn_name)):
                    result = bn_func(values, axis=axis, **kwds)

This is called with alt as the pandas nanmean function, so bn_name is 'nanmean', and this is the attr that's grabbed from the bottleneck module:

(Pdb) l
 93                             result = np.empty(result_shape)
 94                             result.fill(0)
 95                             return result
 96
 97                     if (_USE_BOTTLENECK and skipna and
 98  ->                         _bn_ok_dtype(values.dtype, bn_name)):
 99                         result = bn_func(values, axis=axis, **kwds)
100
101                         # prefer to treat inf/-inf as NA, but must compute the fun
102                         # twice :(
103                         if _has_infs(result):
(Pdb) n
> d:\anaconda3\lib\site-packages\pandas\core\nanops.py(99)f()
-> result = bn_func(values, axis=axis, **kwds)
(Pdb) alt
<function nanmean at 0x000001D2C8C04378>
(Pdb) alt.__name__
'nanmean'
(Pdb) bn_func
<built-in function nanmean>
(Pdb) bn_name
'nanmean'
(Pdb) bn_func(values, axis=axis, **kwds)
-9.0

Pretend that bottleneck_switch() decorator doesn't exist for a second. We can actually see that calling that manually stepping through this function (without bottleneck) will get you the same result as numpy:

(Pdb) from pandas.core.nanops import _get_counts
(Pdb) from pandas.core.nanops import _get_values
(Pdb) from pandas.core.nanops import _ensure_numeric
(Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna)
(Pdb) count = _get_counts(mask, axis=None, dtype=dtype)
(Pdb) count
900000.0
(Pdb) values.sum(axis=None, dtype=dtype) / count
-9.0000029

That never gets called, though, if you have bottleneck installed. Instead, the bottleneck_switch() decorator instead blasts over the nanmean function with bottleneck's version. This is where the discrepancy lies (interestingly it matches on the float64 case, though):

(Pdb) import bottleneck as bn
(Pdb) bn.nanmean(delegate)
-9.0
(Pdb) bn.nanmean(delegate.astype(np.float64))
-9.000003749978807

bottleneck is used solely for speed, as far as I can tell. I'm assuming they're taking some type of shortcut with their nanmean function, but I didn't look into it much (see @ead's answer for details on this topic). You can see that it's typically a bit faster than numpy by their benchmarks: https://github.com/kwgoodman/bottleneck. Clearly the price to pay for this speed is precision.

Is bottleneck actually faster?

Sure looks like it (at least on my machine).

In [1]: import numpy as np; import pandas as pd

In [2]: x=np.random.normal(-9.8,.05,size=900000)

In [3]: y_32 = x.astype(np.float32)

In [13]: %timeit np.nanmean(y_32)
100 loops, best of 3: 5.72 ms per loop

In [14]: %timeit bn.nanmean(y_32)
1000 loops, best of 3: 854 µs per loop

It might be nice for pandas to introduce a flag here (one for speed, the other for better precision, default is for speed since that's the current impl). Some users care much more about the accuracy of the computation than the speed at which it happens.

HTH.

Jahdiel answered 4/11, 2018 at 19:45 Comment(8)
You say "numpy beats it into float64 to improve precision", but the code you show doesn't seem to support this. In numpy.core._methods._mean, the sum (call to umr_sum) ends up being performed with dtype=None.Scholarship
Ah, if you're looking at x.mean(), then x has dtype np.float64 in the first place. That would explain why you're seeing float64 results inside the mean.Scholarship
And if you want convincing that NumPy doesn't do an automatic conversion from float32 to float64 before performing summation, try doing np.ones((10**8, 2), dtype=np.float32).mean(axis=0). It's the use of pairwise summation that's actually making the difference to accuracy in NumPy's case. (As to what Pandas is doing: I have no idea.)Scholarship
@MarkDickinson yes, my mistake. I'll edit later (on phone atm and heading for work soon)Jahdiel
@MarkDickinson as for pandas, though, what happens is that they patch over nanmean in favor of bottlenecks version of said function. I think I'll remove the part about numpy since what they do is more or less straightforward and just make it more clear as to how pandas is behavingJahdiel
@MarkDickinson Okay, updated. I just took at the part about numpy: it's pretty straightforward. It's pandas that is the goofball.Jahdiel
Great answer + explanation. I'm going to give this some air time so it gets more views. I hope it reaches Pandas developers. Seems like an unintended consequence which can have strange and significant impacts beyond float32 vs float64 precision, e.g. OP's extreme example.Snifter
Well, NumPy's behaviour is fairly goofy, too. The fact that np.ones((10**8, 1), dtype=np.float32).mean(axis=0) and np.ones((2, 10**8), dtype=np.float32).mean(axis=1) are accurate but np.ones((10**8, 2), dtype=np.float32).mean(axis=0) is not, is goofy. Explainable, certainly, but still goofy.Scholarship
M
17

@Matt Messersmith answer is a great investigation, but I would like to add an in my opinion important point: both results (numpy's and pandas') are wrong. However, numpy has higher probability to be less wrong than panda.

There is no fundamental difference between using float32 and float64, however, for float32, problems can be observed for smaller data sets than for float64.

It is not really defined, how the mean should be calculated - the given mathematical definition is only unambiguous for infinitely precise numbers, but not for the floating point operations our PCs are using.

So what is the "right" formula?

    mean = (x0+..xn)/n 
  or 
    mean = [(x0+x1)+(x2+x3)+..]/n
  or
    mean = 1.0/n*(x0+..xn)
  and so on...

Obviously, when calculated on modern hardware they all will give different results - one would ideally peek a formula which makes the smallest error compared to a theoretical right value (which is calculated with infinite precision).

Numpy uses slightly alternated pairwise summation, i.e. (((x1+x2)+(x3+x4))+(...)), which, even if not perfect, is known to be quite good. On the other hand, bottleneck uses the naive summation x1+x2+x3+...:

REDUCE_ALL(nanmean, DTYPE0)
{
    ...
    WHILE {
        FOR {
            ai = AI(DTYPE0);
            if (ai == ai) {
                asum += ai;   <---- HERE WE GO
                count += 1;
            }
        }
        NEXT
    }
    ...
}

and we can easily see what goes on: After some steps, bottleneck sums one large (sum of all previous elements, proportional to -9.8*number_of_steps) and one small element (about -9.8) which leads to quite an rounding error of about big_number*eps, with eps being around 1e-7 for float32. This means that after 10^6 summations we could have a relative error of about 10% (eps*10^6, this is an upper bound).

For float64 and eps being about 1e-16 the relative error would be only about 1e-10 after 10^6 summations. It might seem to be precise to us, but measured against the possible precision it is still a fiasco!

Numpy on the other hand (at least for the series at hand) will add two elements which are almost equal. In this case the upper bound for the resulting relative error is eps*log_2(n), which is

  • maximal 2e-6 for float32 and 10^6 elements
  • maximal 2e-15 for float64 and 10^6 elements.

From the above, among others, there are following noteworthy implications:

  • if the mean of the distribution is 0, then pandas and numpy are almost equally precise - the magnitude of summed numbers is about 0.0 and there is no big difference between summands which would lead to a large rounding error for naive summation.
  • if one knows a good estimate for the mean, it might be more robust to calculate the sum of x'i=xi-mean_estimate, because x'i will have mean of 0.0.
  • something like x=(.333*np.ones(1000000)).astype(np.float32) is enough to trigger the strange behavior of pandas' version - no need for randomness, and we know what the result should be, don't we? It is important, that 0.333 cannot be presented precisely with floating point.

NB: The above holds for 1-dimensional numpy-arrays. Situation is more complicated for summing along an axis for multi-dimensional numpy-arrays, as numpy sometimes switches to naive-summation. For a more detailed investigation see this SO-post, which also explains @Mark Dickinson observation, i.e.:

np.ones((2, 10**8), dtype=np.float32).mean(axis=1) are accurate but np.ones((10**8, 2), dtype=np.float32).mean(axis=0) aren't

Mutinous answered 7/11, 2018 at 13:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.