How to tackle inconsistent results while using pandas rolling correlation?
Asked Answered
I

1

9

Let me preface this by saying, in order to reproduce the problem I need a large data, and that is part of the problem, that I can't predict when the peculiarity is going to pop up. Anyway, the data is too large (~13k rows, 2 cols) to be pasted in the question, I have added a pastebin link at the end of the post.


I am facing a peculiar problem for the past few days with pandas.core.window.rolling.Rolling.corr. I have a dataset, where I am trying to calculate rolling correlations. This is the problem:

While calculating rolling (window_size=100) correlations between two columns (a and b): some indices (one such index is 12981) give near 0 values (of order 1e-10), but it should ideally return nan or inf, (because all values in one column are constant). However, if I just calculate standalone correlation pertaining to that index, (i.e. last 100 rows of data including the said index), or perform the rolling calculations on lesser amount of rows (e.g. 300 or 1000 as opposed to 13k), I get the correct result (i.e. nan or inf.)

Expectation:

>>> df = pd.read_csv('sample_corr_data.csv') # link at the end,  ## columns = ['a', 'b']
>>> df.a.tail(100).value_counts()

 0.000000    86
-0.000029     3
 0.000029     3
-0.000029     2
 0.000029     2
-0.000029     2
 0.000029     2
Name: a, dtype: int64

>>> df.b.tail(100).value_counts()     # all 100 values are same
 
6.0    100
Name: b, dtype: int64

>>> df.a.tail(100).corr(df.b.tail(100))
nan                                      # expected, because column 'b' has same value throughout

# Made sure of this using,
# 1. np.corrcoef, because pandas uses this internally to calculate pearson moments
>>> np.corrcoef(df.a.tail(100), df.b.tail(100))[0, 1]
nan

# 2. using custom function
>>> def pearson(a, b):
        n = a.size
        num = n*np.nansum(a*b) - np.nansum(a)*np.nansum(b)
        den = (n*np.nansum((a**2)) - np.nansum(a)**2)*(n*np.nansum(b**2) - np.nansum(b)**2)
        return num/np.sqrt(den) if den * np.isfinite(den*num) else np.nan

>>> pearson(df.a.tail(100), df.b.tail(100))
nan

Now, the reality:

>>> df.a.rolling(100).corr(df.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981    2.755881e-10                    # This should have been NaN/inf !!

## Furthermore!!

>>> debug = df.tail(300)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981            -inf                    # Got -inf, fine
dtype: float64

>>> debug = df.tail(3000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981             inf                     # Got +inf, still acceptable
dtype: float64

This continue till 9369 rows:

>>> debug = df.tail(9369)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             inf
dtype: float64

# then
>>> debug = df.tail(9370)
>>> debug.a.rolling(100).corr(debug.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981    4.719615e-10                    # SPOOKY ACTION IN DISTANCE!!!
dtype: float64

>>> debug = df.tail(10000)
>>> debug.a.rolling(100).corr(debug.b).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981    1.198994e-10                    # SPOOKY ACTION IN DISTANCE!!!    
dtype: float64

Current Workaround

>>> df.a.rolling(100).apply(lambda x: x.corr(df.b.reindex(x.index))).tail(3)   # PREDICTABLY, VERY SLOW!

12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

# again this checks out using other methods,
>>> df.a.rolling(100).apply(lambda x: np.corrcoef(x, df.b.reindex(x.index))[0, 1]).tail(3)
 
12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

>>> df.a.rolling(100).apply(lambda x: pearson(x, df.b.reindex(x.index))).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             NaN
Name: a, dtype: float64

As far as I understand, the result of series.rolling(n).corr(other_series) should match with the following:

>>> def rolling_corr(series, other_series, n=100):
        return pd.Series(
                    [np.nan]*(n-1) + [series[i-n: i].corr(other_series[i-n:i]) 
                    for i in range (n, series.size+1)]
        )

>>> rolling_corr(df.a, df.b).tail(3)

12979    7.761921e-07
12980    5.460717e-07
12981             NaN

First I thought this was a floating-point arithmetic issue (because initially, in some cases, I could fix this by rounding column 'a' to 5 decimal places, or casting to float32), but in that case it would be present irrespective of the number of samples used. So there must be some issue with rolling or at least rolling gives rise to floating-point issues depending on size of the data. I checked source code of rolling.corr, but could not find anything that would explain such inconsistencies. And now I am worried, how many past codes are plagued with this issue.

What is the reason behind this? And how to fix this? If this is happening because may be pandas prefers speed over accuracy (as suggested here), does that mean I can never reliably use pandas.rolling operations on large sample? How do I know the size beyond which this inconsistency would appear?


sample_corr_data.csv: https://pastebin.com/jXXHSv3r

Tested in

  • Windows 10, python 3.9.1, pandas 1.2.2, (IPython 7.20)
  • Windows 10, python 3.8.2, pandas 1.0.5, (IPython 7.19)
  • Ubuntu 20.04, python 3.7.7, pandas 1.0.5, (GCC 7.3.0, standard REPL)
  • CentOS Linux 7 (Core), Python 2.7.5, pandas 0.23.4, (IPython 5.8.0)

Note: Different OS return different values at the said index, but all are finite and near 0.

Inheritor answered 13/3, 2021 at 15:17 Comment(6)
I think issue is within corr because it involves std in calculation. Something similar in numpy: #63773229 github.com/numpy/numpy/issues/9631Hailstone
@QuantChristo But why is this not behaviour consistent? As I've shown, rolling corr works upto 9369 rows. Applying corr manually yields correct result. This inconsistency is bothering me.Inheritor
According to the doc (and the source) you're right about the expected behaviour of the window through rolling. I did try (manually) the _get_corr(a,b) function on the tail of your dataframe ; it returns nan allright... So maybe this has to do with the window or the "flex_binary_moment" (which I don't know anything about)Fontes
@Fontes I have checked flex_binary_moment it just applies the given function on a and b if both a and b are series. And for corr the function is _get_corr(a, b). It uses cov, var and std. Either the issue is with these functions, which I don't think is the case, because they individually return correct results, as you have probably seen.Inheritor
My guess is it is do with the C-code behind Rolling, but I need a definitive answer, otherwise I can't use rolling.corr ever again. And alternatives are really slow. So I need to be sure, before using custom functions, that there is no other option.Inheritor
@Fontes I have created a chatroom: chat.stackoverflow.com/rooms/230066/rolling-corr , we can discuss your answer there, if you want.Inheritor
H
3

What if you compute you replace the sums in your pearson formula with rolling sums


def rolling_pearson(a, b, n):
    a_sum = a.rolling(n).sum()
    b_sum = b.rolling(n).sum()
    ab_sum = (a*b).rolling(n).sum()
    aa_sum = (a**2).rolling(n).sum()
    bb_sum = (b**2).rolling(n).sum();
    
    num = n * ab_sum - a_sum * b_sum;
    den = (n*aa_sum - a_sum**2) * (n * bb_sum - b_sum**2)
    return num / den**(0.5)

rolling_pearson(df.a, df.b, 100)

             ...     
12977    1.109077e-06
12978    9.555249e-07
12979    7.761921e-07
12980    5.460717e-07
12981             inf
Length: 12982, dtype: float64

Why is this so

In order to answer this question I needed to check the implementation. Because indeed the variance of the last 100 samples of b is zero, and the rolling correlation is computed as a.cov(b) / (a.var() * b.var())**0.5.

After some search I found the rolling variance implementation here, the method they are using is the Welford's online algorithm. This algorithm is nice because you can add one sample using only one multiplication (the same as the methods with cumulative sums), and you can calculate with a single integer division. Here rewrite it in python.

def welford_add(existingAggregate, newValue):
    if pd.isna(newValue):
        return s
    (count, mean, M2) = existingAggregate
    count += 1
    delta = newValue - mean
    mean += delta / count
    delta2 = newValue - mean
    M2 += delta * delta2
    return (count, mean, M2)
def welford_remove(existingAggregate, newValue):
    if pd.isna(newValue):
        return s
    (count, mean, M2) = existingAggregate
    count -= 1
    delta = newValue - mean
    mean -= delta / count
    delta2 = newValue - mean
    M2 -= delta * delta2
    return (count, mean, M2)
def finalize(existingAggregate):
    (count, mean, M2) = existingAggregate
    (mean, variance, sampleVariance) = (mean, 
            M2 / count if count > 0 else None, 
            M2 / (count - 1) if count > 1 else None)
    return (mean, variance, sampleVariance)

In the pandas implementation they mention the Kahan's summation, this is important to get better precision in additions, but the results are not improved by that (I didn't check if whether if it is properly implemented or not).

Applying the Welford algorithm with n=100

s = (0,0,0)
for i in range(len(df.b)):
    if i >= n:
        s = welford_remove(s, df.b[i-n])
    s = welford_add(s, df.b[i])
finalize(s)

It gives

(6.000000000000152, 4.7853099260919405e-12, 4.8336463899918594e-12)

And the df.b.rolling(100).var() gives

0                 NaN
1                 NaN
2                 NaN
3                 NaN
4                 NaN
             ...     
12977    6.206061e-01
12978    4.703030e-01
12979    3.167677e-01
12980    1.600000e-01
12981    6.487273e-12
Name: b, Length: 12982, dtype: float64

With error 6.4e-12 slightly higher than the 4.83e-12 given by direct application of the Welford's method.

On the other hand (df.b**2).rolling(n).sum()-df.b.rolling(n).sum()**2/n gives 0.0 for the last entry.

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
         ...  
12977    61.44
12978    46.56
12979    31.36
12980    15.84
12981     0.00
Name: b, Length: 12982, dtype: float64

I hope this explanation is satisfactory :)

Hartzell answered 18/3, 2021 at 21:18 Comment(1)
This is faster, but I mainly want to know why the inconsistency shows up in the first place.Inheritor

© 2022 - 2024 — McMap. All rights reserved.