how to calculate correlation between rows in python pandas data frame

Asked 2/11, 2017 at 12:7 Answered 2/11, 2017 at 12:46

Solved python performance pandas linear-regression correlation

I have large data frame, and I need to calculate efficiently correlation between the data frame rows and given value list. for example:

dfa= DataFrame(np.zeros((1,4)) ,columns=['a','b','c','d'])
dfa.ix[0] = [2,6,8,12]
a   b   c   d
2.0 6.0 8.0 12.0
dfb= DataFrame([[2,6,8,12],[1,3,4,6],[-1,-3,-4,-6]], columns=['a','b','c','d'])
    a   b   c   d
0   2   6   8   12
1   1   3   4   6
2  -1  -3  -4  -6

I expect to get:

0    1
1    0.5
2   -0.5

I tried many version, for example:

dfb.T.corrwith(dfa.T, axis=0)

But ll I get is a lot of Nan's

Want answered 2/11, 2017 at 12:7 Comment(2)

Can you please make clear the rows you are trying to find correlation between? – Rehnberg 2/11, 2017 at 12:12

Be sure to check out NumPy based one for performance - stackoverflow.com/a/47075690. – Reamer 2/11, 2017 at 12:52

First of all, note that the last 2 correlations are 1 and -1 and not 0.5 and -0.5 as you expected.

Solution

dfb.corrwith(dfa.iloc[0], axis=1)

Results

0    1.0
1    1.0
2   -1.0
dtype: float64

Theodosia answered 2/11, 2017 at 12:20 Comment(3)

No need to use for loop, you can dataframe.corrwith(series). I.e.: dfb.corrwith(dfa.iloc[0], axis=1) – Protean 2/11, 2017 at 12:28

remove second solution, first is best – Fossil 2/11, 2017 at 12:30

of course you are right about the correlation numbers. My original task was to find the slope of the linear regression, but that takes too long, so I am looking for a faster solution. – Want 2/11, 2017 at 12:47

I think the number that you are trying to get is not correlation coefficient actually. The correlation between 1st and second row is 1 not 0.5. Correlation is a measure of linear relationship between variables. Here the two lists are strongly correlated with pearson's coefficient 1. If you plot row0 [2,6,8,12] against row1 [1,3,4,6] they all lie on a single line. Mean while if you want to find correlation between rows this should work:

NOTE: the correct correlation is [1,1,-1]

pd.DataFrame(dfb.transpose()).corr()

Rehnberg answered 2/11, 2017 at 12:27 Comment(1)

Thank you it's working. But I need only correlation between one row vs all the rest. Your solution calculates all row pairs correlations. – Want 2/11, 2017 at 12:44

Here's one using the correlation defintion with NumPy tools meant for performance with corr2_coeff_rowwise -

pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))

Sample run -

In [74]: dfa
Out[74]: 
     a    b    c     d
0  2.0  6.0  8.0  12.0

In [75]: dfb
Out[75]: 
   a  b  c   d
0  2  6  8  12
1  1  3  4   6
2 -1 -3 -4  -6

In [76]: pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
Out[76]: 
0    1.0
1    1.0
2   -1.0
dtype: float64

Runtime test

Case #1 : Large number of rows in dfb and 4 columns -

In [77]: dfa = pd.DataFrame(np.random.randint(1,100,(1,4)))

In [78]: dfb = pd.DataFrame(np.random.randint(1,100,(30000,4)))

# @sera's soln
In [79]: %timeit dfb.corrwith(dfa.iloc[0], axis=1)
1 loop, best of 3: 4.09 s per loop

In [80]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
1000 loops, best of 3: 1.53 ms per loop

Case #2 : Decent number of rows in dfb and 400 columns -

In [83]: dfa = pd.DataFrame(np.random.randint(1,100,(1,400)))

In [85]: dfb = pd.DataFrame(np.random.randint(1,100,(300,400)))

In [86]: %timeit dfb.corrwith(dfa.iloc[0], axis=1)
10 loops, best of 3: 44.8 ms per loop

In [87]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
1000 loops, best of 3: 635 µs per loop

Reamer answered 2/11, 2017 at 12:46 Comment(0)

Recommended topics

Hot tags