Correlation coefficients and p values for all pairs of rows of a matrix
Asked Answered
S

8

21

I have a matrix data with m rows and n columns. I used to compute the correlation coefficients between all pairs of rows using np.corrcoef:

import numpy as np
data = np.array([[0, 1, -1], [0, -1, 1]])
np.corrcoef(data)

Now I would also like to have a look at the p-values of these coefficients. np.corrcoef doesn't provide these; scipy.stats.pearsonr does. However, scipy.stats.pearsonr does not accept a matrix on input.

Is there a quick way how to compute both the coefficient and the p-value for all pairs of rows (arriving e.g. at two m by m matrices, one with correlation coefficients, the other with corresponding p-values) without having to manually go through all pairs?

Sought answered 26/6, 2014 at 13:39 Comment(1)
Is there a reason not to just iterate through the row pairs? It is a bit clumsy, but the code is not very long, and most probably it is not going to be a performance problem, as most time is anyway spent calculating the pearsons. (I.e. do you mean "quick" as in your programming time or "quick" as in performance.) I suggest you take the trivial route and profile the actual performance.Thordia
S
16

I have encountered the same problem today.

After half an hour of googling, I can't find any code in numpy/scipy library can help me do this.

So I wrote my own version of corrcoef

import numpy as np
from scipy.stats import pearsonr, betai

def corrcoef(matrix):
    r = np.corrcoef(matrix)
    rf = r[np.triu_indices(r.shape[0], 1)]
    df = matrix.shape[1] - 2
    ts = rf * rf * (df / (1 - rf * rf))
    pf = betai(0.5 * df, 0.5, df / (df + ts))
    p = np.zeros(shape=r.shape)
    p[np.triu_indices(p.shape[0], 1)] = pf
    p[np.tril_indices(p.shape[0], -1)] = p.T[np.tril_indices(p.shape[0], -1)]
    p[np.diag_indices(p.shape[0])] = np.ones(p.shape[0])
    return r, p

def corrcoef_loop(matrix):
    rows, cols = matrix.shape[0], matrix.shape[1]
    r = np.ones(shape=(rows, rows))
    p = np.ones(shape=(rows, rows))
    for i in range(rows):
        for j in range(i+1, rows):
            r_, p_ = pearsonr(matrix[i], matrix[j])
            r[i, j] = r[j, i] = r_
            p[i, j] = p[j, i] = p_
    return r, p

The first version use the result of np.corrcoef, and then calculate p-value based on triangle-upper values of corrcoef matrix.

The second loop version just iterating over rows, do pearsonr manually.

def test_corrcoef():
    a = np.array([
        [1, 2, 3, 4],
        [1, 3, 1, 4],
        [8, 3, 8, 5],
        [2, 3, 2, 1]])

    r1, p1 = corrcoef(a)
    r2, p2 = corrcoef_loop(a)

    assert np.allclose(r1, r2)
    assert np.allclose(p1, p2)

The test passed, they are the same.

def test_timing():
    import time
    a = np.random.randn(100, 2500)

    def timing(func, *args, **kwargs):
        t0 = time.time()
        loops = 10
        for _ in range(loops):
            func(*args, **kwargs)
        print('{} takes {} seconds loops={}'.format(
            func.__name__, time.time() - t0, loops))

    timing(corrcoef, a)
    timing(corrcoef_loop, a)


if __name__ == '__main__':
    test_corrcoef()
    test_timing()

The performance on my Macbook against 100x2500 matrix

corrcoef takes 0.06608104705810547 seconds loops=10

corrcoef_loop takes 7.585600137710571 seconds loops=10

Squier answered 3/7, 2014 at 7:48 Comment(4)
This code fails with scipy 1.0.0 because the betai function has been removed after deprecation. One should use betainc in the scipy.special module instead.Blocking
Thanks for this solution, helped me a lot! Note that the pvalues in this implementation are set to 0 when you compare the same feature (it returns 0 on the diagonal). However, e.g., scipy.stats.pearsonr would return p=1 for these cases.Evacuee
@MartinBecker Do you mean the opposite? This implementation returns 1 in the diagonal while the pvalue in corr, pvalue = scipy.stats.pearsonr(x, x) where x is any array is equal to 0.0.Catastrophe
@Catastrophe Yes, that's what I meant ;) Thanks.Evacuee
P
11

The most consice way of doing it might be the buildin method .corr in pandas, to get r:

In [79]:

import pandas as pd
m=np.random.random((6,6))
df=pd.DataFrame(m)
print df.corr()
          0         1         2         3         4         5
0  1.000000 -0.282780  0.455210 -0.377936 -0.850840  0.190545
1 -0.282780  1.000000 -0.747979 -0.461637  0.270770  0.008815
2  0.455210 -0.747979  1.000000 -0.137078 -0.683991  0.557390
3 -0.377936 -0.461637 -0.137078  1.000000  0.511070 -0.801614
4 -0.850840  0.270770 -0.683991  0.511070  1.000000 -0.499247
5  0.190545  0.008815  0.557390 -0.801614 -0.499247  1.000000

To get p values using t-test:

In [84]:

n=6
r=df.corr()
t=r*np.sqrt((n-2)/(1-r*r))

import scipy.stats as ss
ss.t.cdf(t, n-2)
Out[84]:
array([[ 1.        ,  0.2935682 ,  0.817826  ,  0.23004382,  0.01585695,
         0.64117917],
       [ 0.2935682 ,  1.        ,  0.04363408,  0.17836685,  0.69811422,
         0.50661121],
       [ 0.817826  ,  0.04363408,  1.        ,  0.39783538,  0.06700715,
         0.8747497 ],
       [ 0.23004382,  0.17836685,  0.39783538,  1.        ,  0.84993082,
         0.02756579],
       [ 0.01585695,  0.69811422,  0.06700715,  0.84993082,  1.        ,
         0.15667393],
       [ 0.64117917,  0.50661121,  0.8747497 ,  0.02756579,  0.15667393,
         1.        ]])
In [85]:

ss.pearsonr(m[:,0], m[:,1])
Out[85]:
(-0.28277983892175751, 0.58713640696703184)
In [86]:
#be careful about the difference of 1-tail test and 2-tail test:
0.58713640696703184/2
Out[86]:
0.2935682034835159 #the value in ss.t.cdf(t, n-2) [0,1] cell

Also you can just use the scipy.stats.pearsonr you mentioned in OP:

In [95]:
#returns a list of tuples of (r, p, index1, index2)
import itertools
[ss.pearsonr(m[:,i],m[:,j])+(i, j) for i, j in itertools.product(range(n), range(n))]
Out[95]:
[(1.0, 0.0, 0, 0),
 (-0.28277983892175751, 0.58713640696703184, 0, 1),
 (0.45521036266021014, 0.36434799921123057, 0, 2),
 (-0.3779357902414715, 0.46008763115463419, 0, 3),
 (-0.85083961671703368, 0.031713908656676448, 0, 4),
 (0.19054495489542525, 0.71764166168348287, 0, 5),
 (-0.28277983892175751, 0.58713640696703184, 1, 0),
 (1.0, 0.0, 1, 1),
#etc, etc
Potbellied answered 28/6, 2014 at 17:3 Comment(1)
Just to clarify, your original function calculates p-value of the two-sided test, and then you divide it by two to get the p-value of the one-sided test, is this correct? And yes, this is still not implemented in neither numpy nor scipy after your post 7 years agoImpartial
C
4

Sort of hackish and possibly inefficient, but I think this could be what you're looking for:

import scipy.spatial.distance as dist

import scipy.stats as ss

# Pearson's correlation coefficients
print dist.squareform(dist.pdist(data, lambda x, y: ss.pearsonr(x, y)[0]))    

# p-values
print dist.squareform(dist.pdist(data, lambda x, y: ss.pearsonr(x, y)[1]))

Scipy's pdist is a very helpful function, which is primarily meant for finding Pairwise distances between observations in n-dimensional space.

But it allows user defined callable 'distance metrics', which can be exploited to carry out any kind of pair-wise operation. The result is returned in a condensed distance matrix form, which can be easily changed to the square matrix form using Scipy's 'squareform' function.

Cardie answered 12/12, 2014 at 20:40 Comment(5)
Rather than passing your own Python function for computing the correlation coefficient, you can use metric='correlation' which is equal to (1 - correlation coefficient), and is coded in C (so should be much more efficient).Thriftless
He's looking for p-values as well. You won't get the p-values if you use the inbuilt correlation metric.Cardie
You can derive p-values from the correlation coefficients fairly easily (see jingchao's answer and here)Thriftless
(also CT Zhu's answer)Thriftless
This approach fulfilled my needs, and it seems straightforward to me. Please follow any answer that suits you the most.Cardie
Z
1

If you do not have to use pearson correlation coefficient, you can use the spearman correlation coefficient, as it returns both the correlation matrix and p-values (note that the former requires that your data is normally distributed, whereas the spearman correlation is a non-parametric measure, thus not assuming the normal distribution of your data). An example code:

from scipy import stats
import numpy as np

data = np.array([[0, 1, -1], [0, -1, 1], [0, 1, -1]])
print 'np.corrcoef:', np.corrcoef(data)
cor, pval = stats.spearmanr(data.T)
print 'stats.spearmanr - cor:\n', cor
print 'stats.spearmanr - pval\n', pval
Zebada answered 9/3, 2018 at 6:37 Comment(0)
E
1

this is exactly the same performance as the corrcoef in MATLAB:

to have this function work, you will need to install pandas as well as scipy.

# Compute correlation correfficients matrix and p-value matrix
# Similar function as corrcoef in MATLAB
# dframe: pandas dataframe
def corrcoef(dframe):

    fmatrix = dframe.values
    rows, cols = fmatrix.shape

    r = np.ones((cols, cols), dtype=float)
    p = np.ones((cols, cols), dtype=float)

    for i in range(cols):
        for j in range(cols):
            if i == j:
                r_, p_ = 1., 1.
            else:
                r_, p_ = pearsonr(fmatrix[:,i], fmatrix[:,j])

            r[j][i] = r_
            p[j][i] = p_

    return r, p
Eldrida answered 14/5, 2019 at 11:20 Comment(0)
I
0

Here is a minimal version of @CT Zhu's answer. We do not need pandas, as correlation can be computed directly from numpy, which should be faster, since we do not need the step of converting to a dataframe

import numpy as np
import scipy.stats as ss

def corr_significance_two_sided(cc, nData):
    # We will divide by 0 if correlation is exactly 1, but that is no problem
    # We would simply set the test statistic to be infinity if it evaluates to NAN
    with np.errstate(divide='ignore'):
        t = -np.abs(cc) * np.sqrt((nData - 2) / (1 - cc**2))
        t[t == np.nan] = np.inf
        return ss.t.cdf(t, nData - 2) * 2  # multiply by two to get two-sided p-value

x = np.random.uniform(0, 1, (8, 1000))
cc = np.corrcoef(x)
pVal = corr_significance_two_sided(cc, 1000)
Impartial answered 1/2, 2021 at 16:48 Comment(0)
M
0

In case anyone has a similar problem but your matrix is a pd.DataFrame object, I wrote the following code:

from scipy.stats import pearsonr

def corr_pval(df):
    corr_pval_df = pd.DataFrame(index=df.columns, columns=df.columns)
    for i in range(len(corr_pval_df.index)):
        for c in range(len(corr_pval_df.columns)):
            corr_pval_df.iloc[i, c] = pearsonr(df[corr_pval_df.index[i]], df[corr_pval_df.columns[c]])
    return corr_pval_df
        
 corr_pval(corr_df)
Missile answered 10/1, 2022 at 14:54 Comment(0)
B
0

I have the solution! I needed to do this for arrays of size 2000x30,000 and using the above methods or a double loop was just not feasible, especially when it seemed like there was an obvious solution I was missing. So I looked into scipy's implementation of Pearson Correlation to figure out how they calculated the p-value and see if I could optimize it for 2-d arrays. In the notes, they explain that they estimate the PDF of the Pearson correlation coefficient (r) and calculate the two-sided p-value from this 'r'.

Under the assumption that x and y are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient r is :

$$ f(r) = \frac{\left ( 1-r^2 \right )^{\frac{n}{2}-2}}{B\left ( \frac{1}{2},\frac{n}{2}-1 \right )}$$

where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. For a given sample with correlation coefficient r, the p-value is the probability that abs(r’) of a random sample x’ and y’ drawn from the population with zero correlation would be greater than or equal to abs(r)

This is easily applied to 2-d arrays , I'm surprised that they don't have this functionality in np.corrcoef.

import numpy as np
from scipy import stats
N, L = 100, 2500
signals = np.random.random((N, L)).astype(np.float64)
R = np.corrcoef(signals)

The following was taken directly from the notes of Scipy's pearsonr

dist = stats.beta(L/2 - 1, L/2 - 1, loc=-1, scale=2)
P = 2*dist.cdf(-abs(R))

This method is near-instant, about as fast as np.corrcoef. I checked if the values are right by comparing it the double loop way as well and got this.

testR = np.empty_like(R)
testP = np.empty_like(P)

for i, sigA in enumerate(signals):
    for j, sigB in enumerate(signals):
        testR[i,j], testP[i,j] = pearsonr(sigA, sigB)

print(np.abs(testP - P).mean(), np.abs(testR - R).mean(), np.abs(R - R.T).mean())
#1.3860685982216431e-14 2.8794441450755465e-17 8.459507779608882e-19
#There's probably some float rounding error that causes the discrepancies in the 14th/17th decimal place. 
Bidentate answered 8/12, 2023 at 18:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.