how to calculate correlation between rows in python pandas data frame
Asked Answered
W

3

7

I have large data frame, and I need to calculate efficiently correlation between the data frame rows and given value list. for example:

dfa= DataFrame(np.zeros((1,4)) ,columns=['a','b','c','d'])
dfa.ix[0] = [2,6,8,12]
a   b   c   d
2.0 6.0 8.0 12.0
dfb= DataFrame([[2,6,8,12],[1,3,4,6],[-1,-3,-4,-6]], columns=['a','b','c','d'])
    a   b   c   d
0   2   6   8   12
1   1   3   4   6
2  -1  -3  -4  -6

I expect to get:

0    1
1    0.5
2   -0.5

I tried many version, for example:

dfb.T.corrwith(dfa.T, axis=0)

But ll I get is a lot of Nan's

Want answered 2/11, 2017 at 12:7 Comment(2)
Can you please make clear the rows you are trying to find correlation between?Rehnberg
Be sure to check out NumPy based one for performance - stackoverflow.com/a/47075690.Reamer
T
6

First of all, note that the last 2 correlations are 1 and -1 and not 0.5 and -0.5 as you expected.

Solution

dfb.corrwith(dfa.iloc[0], axis=1)

Results

0    1.0
1    1.0
2   -1.0
dtype: float64
Theodosia answered 2/11, 2017 at 12:20 Comment(3)
No need to use for loop, you can dataframe.corrwith(series). I.e.: dfb.corrwith(dfa.iloc[0], axis=1)Protean
remove second solution, first is bestFossil
of course you are right about the correlation numbers. My original task was to find the slope of the linear regression, but that takes too long, so I am looking for a faster solution.Want
R
4

I think the number that you are trying to get is not correlation coefficient actually. The correlation between 1st and second row is 1 not 0.5. Correlation is a measure of linear relationship between variables. Here the two lists are strongly correlated with pearson's coefficient 1. If you plot row0 [2,6,8,12] against row1 [1,3,4,6] they all lie on a single line. Mean while if you want to find correlation between rows this should work:

NOTE: the correct correlation is [1,1,-1]

pd.DataFrame(dfb.transpose()).corr()

Rehnberg answered 2/11, 2017 at 12:27 Comment(1)
Thank you it's working. But I need only correlation between one row vs all the rest. Your solution calculates all row pairs correlations.Want
R
2

Here's one using the correlation defintion with NumPy tools meant for performance with corr2_coeff_rowwise -

pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))

Sample run -

In [74]: dfa
Out[74]: 
     a    b    c     d
0  2.0  6.0  8.0  12.0

In [75]: dfb
Out[75]: 
   a  b  c   d
0  2  6  8  12
1  1  3  4   6
2 -1 -3 -4  -6

In [76]: pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
Out[76]: 
0    1.0
1    1.0
2   -1.0
dtype: float64

Runtime test

Case #1 : Large number of rows in dfb and 4 columns -

In [77]: dfa = pd.DataFrame(np.random.randint(1,100,(1,4)))

In [78]: dfb = pd.DataFrame(np.random.randint(1,100,(30000,4)))

# @sera's soln
In [79]: %timeit dfb.corrwith(dfa.iloc[0], axis=1)
1 loop, best of 3: 4.09 s per loop

In [80]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
1000 loops, best of 3: 1.53 ms per loop

Case #2 : Decent number of rows in dfb and 400 columns -

In [83]: dfa = pd.DataFrame(np.random.randint(1,100,(1,400)))

In [85]: dfb = pd.DataFrame(np.random.randint(1,100,(300,400)))

In [86]: %timeit dfb.corrwith(dfa.iloc[0], axis=1)
10 loops, best of 3: 44.8 ms per loop

In [87]: %timeit pd.Series(corr2_coeff_rowwise(dfa.values,dfb.values))
1000 loops, best of 3: 635 µs per loop
Reamer answered 2/11, 2017 at 12:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.