Cross-correlation (time-lag-correlation) with pandas?
Asked Answered
R

3

67

I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.

I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.

Edit

The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr knows this, whereas np.correlate just produces a 1020 entries (length of the longer series) array full of nan.

The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.

Edit2

Some minimal sample data:

import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)

Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:

bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)

This is what I get when I correlate with pandas and shift one dataset:

In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523

And trying scipy:

In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
Out[457]: 
array([[ nan],
       [ nan],
       [ nan],
       ..., 
       [ nan],
       [ nan],
       [ nan]])

which according to whos is

scicorr               ndarray                       1801x1: 1801 elems, type `float64`, 14408 bytes

But I'd just like to have 12 entries. /Edit2

The idea I have come up with, is to implement a time-lag-correlation myself, like so:

corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on

But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.

Rumph answered 16/10, 2015 at 13:12 Comment(10)
If you haven't seen these already, consider making use of the scipy.signal.correlate and scipy.signal.correlate2d. I would say that converting to numpy arrays is probably your best bet.Samadhi
I have seen those, but I want to avoid going to numpy, because after this step, I would have to convert back to a dataframe, for further calculations. I guess I will try to reinvent the wheel, then…Rumph
That is a pretty common work flow as far as I know, converting to numpy and back. I don't see a need to hesitate in doing so. I would recommend writing your arrays to disk, so you don't repeat the conversions in your code. Checkout pd.HDFStore and h5py. If you feel up to reinventing the wheel, go for it.Samadhi
Btw check into pandas apply/ufunc object. You've probably found this already though. You can actually put a numpy function into the pandas apply object. So this could do the trickSamadhi
Didn't know series.apply, thanks, that might come in handy later. The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr knows this, whereas np.correlate just produces a 1020 entries array full of nan. I just need to shift for seeing the max correlation within one year.Rumph
Could you share some data in a pastebin or gist or any place of preference?Samadhi
I have edited in some fake sample data, and how/why it is shaped like it is, so that it behaves and looks the same as my real data.Rumph
@Rumph did you get anywhere with this? Also interested in a pandas based solution. Thanks in advance!Squeeze
Funny that you ask… When I asked this question, I just went with the very rough thing I sketched in my question anyways, and then moved on, but as of today I have to do quite a lot of them, so I am working on something. The core is @DanielWatkins nice little piece of code below. If I get to something useful, I'll add it here.Rumph
@Rumph , did you have a look at this package ? statsmodels.org/devel/examples/index.html (if you don't mind leaving pandas for those calculations)Hart
M
76

As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method autocorr, you can see you've got the right idea:

def autocorr(self, lag=1):
    """
    Lag-N autocorrelation

    Parameters
    ----------
    lag : int, default 1
        Number of lags to apply before performing autocorrelation.

    Returns
    -------
    autocorr : float
    """
    return self.corr(self.shift(lag))

So a simple timelagged cross covariance function would be

def crosscorr(datax, datay, lag=0):
    """ Lag-N cross correlation. 
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length

    Returns
    ----------
    crosscorr : float
    """
    return datax.corr(datay.shift(lag))

Then if you wanted to look at the cross correlations at each month, you could do

 xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]
Mossbunker answered 13/5, 2016 at 17:17 Comment(3)
Thanks, that helps quite a bit! Totally forgot that the built in autocorrelation is essentially a time lag correlation. I will see if I can work with that to produce some more useful output than just a list.Rumph
Just found this -- great answer!Greyso
When I apply this solution to my panda series it gives nan despite the two series are differentYuri
B
15

To build up on Andre's answer - if you only care about (lagged) correlation to the target, but want to test various lags (e.g. to see which lag gives the highest correlations), you can do something like this:

lagged_correlation = pd.DataFrame.from_dict(
    {x: [df[target].corr(df[x].shift(-t)) for t in range(max_lag)] for x in df.columns})

This way, each row corresponds to a different lag value, and each column corresponds to a different variable (one of them is the target itself, giving the autocorrelation).

Banwell answered 3/4, 2019 at 8:42 Comment(0)
W
14

There is a better approach: You can create a function that shifted your dataframe first before calling the corr().

Get this dataframe like an example:

d = {'prcp': [0.1,0.2,0.3,0.0], 'stp': [0.0,0.1,0.2,0.3]}
df = pd.DataFrame(data=d)

>>> df
   prcp  stp
0   0.1  0.0
1   0.2  0.1
2   0.3  0.2
3   0.0  0.3

Your function to shift others columns (except the target):

def df_shifted(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for c in df.columns:
        if c == target:
            new[c] = df[target]
        else:
            new[c] = df[c].shift(periods=lag)
    return  pd.DataFrame(data=new)

Supposing that your target is comparing the prcp (precipitation variable) with stp(atmospheric pressure)

If you do at the present will be:

>>> df.corr()
      prcp  stp
prcp   1.0 -0.2
stp   -0.2  1.0

But if you shifted 1(one) period all other columns and keep the target (prcp):

df_new = df_shifted(df, 'prcp', lag=-1)

>>> print df_new
   prcp  stp
0   0.1  0.1
1   0.2  0.2
2   0.3  0.3
3   0.0  NaN

Note that now the column stp is shift one up position at period, so if you call the corr(), will be:

>>> df_new.corr()
      prcp  stp
prcp   1.0  1.0
stp    1.0  1.0

So, you can do with lag -1, -2, -n!!

Wont answered 13/2, 2018 at 16:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.