Let me preface this by saying, in order to reproduce the problem I need a large data, and that is part of the problem, that I can't predict when the peculiarity is going to pop up. Anyway, the data is too large (~13k rows, 2 cols) to be pasted in the question, I have added a pastebin link at the end of the post.
I am facing a peculiar problem for the past few days with pandas.core.window.rolling.Rolling.corr
. I have a dataset, where I am trying to calculate rolling correlations. This is the problem:
While calculating rolling (
window_size=100
) correlations between two columns (a
andb
): some indices (one such index is 12981) give near0
values (of order1e-10
), but it should ideally returnnan
orinf
, (because all values in one column are constant). However, if I just calculate standalone correlation pertaining to that index, (i.e. last 100 rows of data including the said index), or perform the rolling calculations on lesser amount of rows (e.g. 300 or 1000 as opposed to 13k), I get the correct result (i.e.nan
orinf
.)
Expectation:
>>> df = pd.read_csv('sample_corr_data.csv') # link at the end, ## columns = ['a', 'b']
>>> df.a.tail(100).value_counts()
0.000000 86
-0.000029 3
0.000029 3
-0.000029 2
0.000029 2
-0.000029 2
0.000029 2
Name: a, dtype: int64
>>> df.b.tail(100).value_counts() # all 100 values are same
6.0 100
Name: b, dtype: int64
>>> df.a.tail(100).corr(df.b.tail(100))
nan # expected, because column 'b' has same value throughout
# Made sure of this using,
# 1. np.corrcoef, because pandas uses this internally to calculate pearson moments
>>> np.corrcoef(df.a.tail(100), df.b.tail(100))[0, 1]
nan
# 2. using custom function
>>> def pearson(a, b):
n = a.size
num = n*np.nansum(a*b) - np.nansum(a)*np.nansum(b)
den = (n*np.nansum((a**2)) - np.nansum(a)**2)*(n*np.nansum(b**2) - np.nansum(b)**2)
return num/np.sqrt(den) if den * np.isfinite(den*num) else np.nan
>>> pearson(df.a.tail(100), df.b.tail(100))
nan
Now, the reality:
>>> df.a.rolling(100).corr(df.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 2.755881e-10 # This should have been NaN/inf !!
## Furthermore!!
>>> debug = df.tail(300)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 -inf # Got -inf, fine
dtype: float64
>>> debug = df.tail(3000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 inf # Got +inf, still acceptable
dtype: float64
This continue till 9369
rows:
>>> debug = df.tail(9369)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 inf
dtype: float64
# then
>>> debug = df.tail(9370)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 4.719615e-10 # SPOOKY ACTION IN DISTANCE!!!
dtype: float64
>>> debug = df.tail(10000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 1.198994e-10 # SPOOKY ACTION IN DISTANCE!!!
dtype: float64
Current Workaround
>>> df.a.rolling(100).apply(lambda x: x.corr(df.b.reindex(x.index))).tail(3) # PREDICTABLY, VERY SLOW!
12979 7.761921e-07
12980 5.460717e-07
12981 NaN
Name: a, dtype: float64
# again this checks out using other methods,
>>> df.a.rolling(100).apply(lambda x: np.corrcoef(x, df.b.reindex(x.index))[0, 1]).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 NaN
Name: a, dtype: float64
>>> df.a.rolling(100).apply(lambda x: pearson(x, df.b.reindex(x.index))).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 NaN
Name: a, dtype: float64
As far as I understand, the result of series.rolling(n).corr(other_series)
should match with the following:
>>> def rolling_corr(series, other_series, n=100):
return pd.Series(
[np.nan]*(n-1) + [series[i-n: i].corr(other_series[i-n:i])
for i in range (n, series.size+1)]
)
>>> rolling_corr(df.a, df.b).tail(3)
12979 7.761921e-07
12980 5.460717e-07
12981 NaN
First I thought this was a floating-point arithmetic
issue (because initially, in some cases, I could fix this by rounding column 'a' to 5 decimal places, or casting to float32
), but in that case it would be present irrespective of the number of samples used. So there must be some issue with rolling
or at least rolling
gives rise to floating-point
issues depending on size of the data. I checked source code of rolling.corr
, but could not find anything that would explain such inconsistencies. And now I am worried, how many past codes are plagued with this issue.
What is the reason behind this? And how to fix this? If this is happening because may be pandas prefers speed over accuracy (as suggested here), does that mean I can never reliably use pandas.rolling
operations on large sample? How do I know the size beyond which this inconsistency would appear?
sample_corr_data.csv: https://pastebin.com/jXXHSv3r
Tested in
- Windows 10, python 3.9.1, pandas 1.2.2, (IPython 7.20)
- Windows 10, python 3.8.2, pandas 1.0.5, (IPython 7.19)
- Ubuntu 20.04, python 3.7.7, pandas 1.0.5, (GCC 7.3.0, standard REPL)
- CentOS Linux 7 (Core), Python 2.7.5, pandas 0.23.4, (IPython 5.8.0)
Note: Different OS return different values at the said index, but all are finite and near 0
.
corr
because it involvesstd
in calculation. Something similar in numpy: #63773229 github.com/numpy/numpy/issues/9631 – Hailstonenan
allright... So maybe this has to do with the window or the "flex_binary_moment" (which I don't know anything about) – Fontesflex_binary_moment
it just applies the given function ona
andb
if botha
andb
are series. And forcorr
the function is_get_corr(a, b)
. It usescov, var and std
. Either the issue is with these functions, which I don't think is the case, because they individually return correct results, as you have probably seen. – Inheritorrolling.corr
ever again. And alternatives are really slow. So I need to be sure, before using custom functions, that there is no other option. – Inheritor