Strange or inaccurate result with rolling sum (floating point precision)
Asked Answered
B

2

10

I have a series I get from an outside source (x). It's all positive, and is mostly zero.

x.describe()
count    23275.000000
mean         0.015597
std          0.411720
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max         26.000000
dtype: float64

However, running rolling_sum on it produces values smaller than zero. Why does it happen? Any way to avoid/bypass it?

rolling_sum(x, window=100).iloc[-1]
-1.4743761767e-13
(rolling_sum(x, window=100)<0).sum()
16291

What's even weirder is that these two calculations (which to the best of my knowledge should yield the same value) does not:

rolling_sum(x, window=100).iloc[-1]
-1.4743761767e-13
rolling_sum(x.iloc[-100:], window=100).iloc[-1]
0.0

(This is both with pandas 0.14.1 and 0.15.2)

Blamable answered 9/3, 2015 at 21:31 Comment(7)
Possibly related (apparently fixed, however) bugTelesis
Does x.sort(ascending=False) before rolling_sum fix the issue?Telesis
@Telesis thanks, but sadly - no..Blamable
Does pandas.rolling_sum work? (Instead of pandas.stats.moments.rolling_sum)Telesis
@Telesis It produces the same resultBlamable
first idea: is it possible there are some very small negative numbers in there that get rounded by describe(). Maybe sort and then print the first value just to be sure, e.g. print "%20.15f" % x[0] Alternate
second idea: these differences are very small, it could just be a precision issue. It's easy to do something 2 slightly different ways that are essentially the same and then get differences like 1.0e-13. Unless you actually care about that level of precision, I'm not sure you really have a problem. E.g. In [304]: .1 + .1 + .1 - .3 Out[304]: 5.551115123125783e-17Alternate
A
10

I think I can guess what is happening:

In [481]: df=pd.DataFrame( { 'x':[0,0,.1,.2,0,0] } )

In [482]: df2 = pd.rolling_sum(df,window=2)

In [483]: df2
Out[483]: 
              x
0           NaN
1  0.000000e+00
2  1.000000e-01
3  3.000000e-01
4  2.000000e-01
5  2.775558e-17

It looks OK, except for the last one, right? In fact, the rounding has obscured that some of the other entries are not as clean as they appear at first glance. It's just that the default display formats are going to disguise this unless you have a value very close to zero.

In [493]: for i in range(6):
     ...:     print '%22.19f' % df2.ix[i,'x']
                   nan
 0.0000000000000000000
 0.1000000000000000056
 0.3000000000000000444
 0.2000000000000000389
 0.0000000000000000278

What's happening here is that rolling_sum is not going to actually do a fresh sum each time. Rather it is going to update the sum by adding the newest number and removing the oldest number. In this trivial example with window=2, that won't be useful, but if the window is much larger, that could speed the computation up considerably, so it makes sense to do it that way.

However, that means that some unexpected results can happen. You're expecting the last rolling sum to be the results of 0+0, but it's not, it actually something like this:

In [492]: (.0+.0)+(.1-.0)+(.2-.0)+(.0-.1)+(.0-.2)
Out[492]: 2.7755575615628914e-17

Bottom line: Your results are basically fine. It just happens that the way you did it (with these data) revealed the underlying precision issues that are inherent in these things. This happens a lot but the default display will generally hide these things that are happening at the 13th decimal place.

Edit to add: Based on Korem's comment, small negative numbers are in fact causing a problem. I think the best thing to do in this case is to use numpy's around function and replace the second step above with:

 df2 = np.around(pd.rolling_sum(df,window=2),decimals=5)

That will force all small numbers (positive or negative) to zero. I think that's a pretty safe general solution. If all your data have integer values you could recast as integers, but that's not a very general solution, obviously.

Alternate answered 10/3, 2015 at 5:15 Comment(0)
A
6

This issue is also with the pd.rolling() method and also occurs if you include a large positive integer in a list of relatively smaller values with high precision.

import pandas as pd
x = pd.DataFrame([0, 1, 2, 2.23425304, 3.2342352934, 4.32423857239])
x.rolling(window=2).mean()
          0
0       NaN
1  0.500000
2  1.500000
3  2.117127
4  2.734244
5  3.779237

Replacing the second element with 1E15...

x = pd.DataFrame([0, 1, 1E15, 2.23425304, 3.2342352934, 4.32423857239])
x.rolling(window=2).mean()
              0
0           NaN
1  5.000000e-01
2  5.000000e+14
3  5.000000e+14
4  2.750000e+00
5  3.794993e+00

It's more pronounced with the rolling standard deviation...

x = pd.DataFrame([0, 1, 2, 2.23425304, 3.2342352934, 4.32423857239])
x.rolling(window=2).std()
          0
0       NaN
1  0.707107
2  0.707107
3  0.165642
4  0.707094
5  0.770749

x = pd.DataFrame([0, 1, 1E15, 2.23425304, 3.2342352934, 4.32423857239])
x.rolling(window=2).std()
              0
0           NaN
1  7.071068e-01
2  7.071068e+14
3  7.071068e+14
4  1.186328e+07
5  1.186328e+07

The only solution seems to sacrifice the performance benefits for accuracy, i.e. do the rolling mean directly.

def rolling_window_slow(window, df):
    df_mean = []
    for i in range(len(df) - window):
        df_mean.append(df.iloc[i:i+window, :].mean())
    return df_mean
Accusation answered 28/5, 2019 at 17:33 Comment(2)
This was such a frustrating bug to hunt down. And you're right, it impacted the .std() results way more than the .mean() results (the .mean() results weren't impacted at all in my case, so I was extra confused why the .std() results were behaving so unexpectedly).Adiaphorism
HOLY S**T! This is the most frustrating bug I have ever had the pain of finding. I was getting totally different std results running local and in CI than in production pods. Took me at least 20 hours to trace this down, but could have gone longer without finding this answer. THANK YOU!Hardesty

© 2022 - 2024 — McMap. All rights reserved.