Correlation coefficients and p values for all pairs of rows of a matrix

Asked 26/6, 2014 at 13:39 Answered 8/12, 2023 at 18:41

Solved python numpy statistics scipy correlation

I have a matrix data with m rows and n columns. I used to compute the correlation coefficients between all pairs of rows using np.corrcoef:

import numpy as np
data = np.array([[0, 1, -1], [0, -1, 1]])
np.corrcoef(data)

Now I would also like to have a look at the p-values of these coefficients. np.corrcoef doesn't provide these; scipy.stats.pearsonr does. However, scipy.stats.pearsonr does not accept a matrix on input.

Is there a quick way how to compute both the coefficient and the p-value for all pairs of rows (arriving e.g. at two m by m matrices, one with correlation coefficients, the other with corresponding p-values) without having to manually go through all pairs?

Sought answered 26/6, 2014 at 13:39 Comment(1)

Is there a reason not to just iterate through the row pairs? It is a bit clumsy, but the code is not very long, and most probably it is not going to be a performance problem, as most time is anyway spent calculating the pearsons. (I.e. do you mean "quick" as in your programming time or "quick" as in performance.) I suggest you take the trivial route and profile the actual performance. – Thordia 26/6, 2014 at 15:10

I have encountered the same problem today.

After half an hour of googling, I can't find any code in numpy/scipy library can help me do this.

So I wrote my own version of corrcoef

import numpy as np
from scipy.stats import pearsonr, betai

def corrcoef(matrix):
    r = np.corrcoef(matrix)
    rf = r[np.triu_indices(r.shape[0], 1)]
    df = matrix.shape[1] - 2
    ts = rf * rf * (df / (1 - rf * rf))
    pf = betai(0.5 * df, 0.5, df / (df + ts))
    p = np.zeros(shape=r.shape)
    p[np.triu_indices(p.shape[0], 1)] = pf
    p[np.tril_indices(p.shape[0], -1)] = p.T[np.tril_indices(p.shape[0], -1)]
    p[np.diag_indices(p.shape[0])] = np.ones(p.shape[0])
    return r, p

def corrcoef_loop(matrix):
    rows, cols = matrix.shape[0], matrix.shape[1]
    r = np.ones(shape=(rows, rows))
    p = np.ones(shape=(rows, rows))
    for i in range(rows):
        for j in range(i+1, rows):
            r_, p_ = pearsonr(matrix[i], matrix[j])
            r[i, j] = r[j, i] = r_
            p[i, j] = p[j, i] = p_
    return r, p

The first version use the result of np.corrcoef, and then calculate p-value based on triangle-upper values of corrcoef matrix.

The second loop version just iterating over rows, do pearsonr manually.

def test_corrcoef():
    a = np.array([
        [1, 2, 3, 4],
        [1, 3, 1, 4],
        [8, 3, 8, 5],
        [2, 3, 2, 1]])

    r1, p1 = corrcoef(a)
    r2, p2 = corrcoef_loop(a)

    assert np.allclose(r1, r2)
    assert np.allclose(p1, p2)

The test passed, they are the same.

def test_timing():
    import time
    a = np.random.randn(100, 2500)

    def timing(func, *args, **kwargs):
        t0 = time.time()
        loops = 10
        for _ in range(loops):
            func(*args, **kwargs)
        print('{} takes {} seconds loops={}'.format(
            func.__name__, time.time() - t0, loops))

    timing(corrcoef, a)
    timing(corrcoef_loop, a)


if __name__ == '__main__':
    test_corrcoef()
    test_timing()

The performance on my Macbook against 100x2500 matrix

corrcoef takes 0.06608104705810547 seconds loops=10

corrcoef_loop takes 7.585600137710571 seconds loops=10

Squier answered 3/7, 2014 at 7:48 Comment(4)

This code fails with scipy 1.0.0 because the betai function has been removed after deprecation. One should use betainc in the scipy.special module instead. – Blocking 29/9, 2017 at 16:39

Thanks for this solution, helped me a lot! Note that the pvalues in this implementation are set to 0 when you compare the same feature (it returns 0 on the diagonal). However, e.g., scipy.stats.pearsonr would return p=1 for these cases. – Evacuee 29/11, 2019 at 0:42

@MartinBecker Do you mean the opposite? This implementation returns 1 in the diagonal while the pvalue in corr, pvalue = scipy.stats.pearsonr(x, x) where x is any array is equal to 0.0. – Catastrophe 25/6, 2021 at 14:39

@Catastrophe Yes, that's what I meant ;) Thanks. – Evacuee 28/6, 2021 at 3:53

The most consice way of doing it might be the buildin method .corr in pandas, to get r:

In [79]:

import pandas as pd
m=np.random.random((6,6))
df=pd.DataFrame(m)
print df.corr()
          0         1         2         3         4         5
0  1.000000 -0.282780  0.455210 -0.377936 -0.850840  0.190545
1 -0.282780  1.000000 -0.747979 -0.461637  0.270770  0.008815
2  0.455210 -0.747979  1.000000 -0.137078 -0.683991  0.557390
3 -0.377936 -0.461637 -0.137078  1.000000  0.511070 -0.801614
4 -0.850840  0.270770 -0.683991  0.511070  1.000000 -0.499247
5  0.190545  0.008815  0.557390 -0.801614 -0.499247  1.000000

To get p values using t-test:

In [84]:

n=6
r=df.corr()
t=r*np.sqrt((n-2)/(1-r*r))

import scipy.stats as ss
ss.t.cdf(t, n-2)
Out[84]:
array([[ 1.        ,  0.2935682 ,  0.817826  ,  0.23004382,  0.01585695,
         0.64117917],
       [ 0.2935682 ,  1.        ,  0.04363408,  0.17836685,  0.69811422,
         0.50661121],
       [ 0.817826  ,  0.04363408,  1.        ,  0.39783538,  0.06700715,
         0.8747497 ],
       [ 0.23004382,  0.17836685,  0.39783538,  1.        ,  0.84993082,
         0.02756579],
       [ 0.01585695,  0.69811422,  0.06700715,  0.84993082,  1.        ,
         0.15667393],
       [ 0.64117917,  0.50661121,  0.8747497 ,  0.02756579,  0.15667393,
         1.        ]])
In [85]:

ss.pearsonr(m[:,0], m[:,1])
Out[85]:
(-0.28277983892175751, 0.58713640696703184)
In [86]:
#be careful about the difference of 1-tail test and 2-tail test:
0.58713640696703184/2
Out[86]:
0.2935682034835159 #the value in ss.t.cdf(t, n-2) [0,1] cell

Also you can just use the scipy.stats.pearsonr you mentioned in OP:

In [95]:
#returns a list of tuples of (r, p, index1, index2)
import itertools
[ss.pearsonr(m[:,i],m[:,j])+(i, j) for i, j in itertools.product(range(n), range(n))]
Out[95]:
[(1.0, 0.0, 0, 0),
 (-0.28277983892175751, 0.58713640696703184, 0, 1),
 (0.45521036266021014, 0.36434799921123057, 0, 2),
 (-0.3779357902414715, 0.46008763115463419, 0, 3),
 (-0.85083961671703368, 0.031713908656676448, 0, 4),
 (0.19054495489542525, 0.71764166168348287, 0, 5),
 (-0.28277983892175751, 0.58713640696703184, 1, 0),
 (1.0, 0.0, 1, 1),
#etc, etc

Potbellied answered 28/6, 2014 at 17:3 Comment(1)

Just to clarify, your original function calculates p-value of the two-sided test, and then you divide it by two to get the p-value of the one-sided test, is this correct? And yes, this is still not implemented in neither numpy nor scipy after your post 7 years ago – Impartial 1/2, 2021 at 16:40

Sort of hackish and possibly inefficient, but I think this could be what you're looking for:

import scipy.spatial.distance as dist

import scipy.stats as ss

# Pearson's correlation coefficients
print dist.squareform(dist.pdist(data, lambda x, y: ss.pearsonr(x, y)[0]))    

# p-values
print dist.squareform(dist.pdist(data, lambda x, y: ss.pearsonr(x, y)[1]))

Scipy's pdist is a very helpful function, which is primarily meant for finding Pairwise distances between observations in n-dimensional space.

But it allows user defined callable 'distance metrics', which can be exploited to carry out any kind of pair-wise operation. The result is returned in a condensed distance matrix form, which can be easily changed to the square matrix form using Scipy's 'squareform' function.

Cardie answered 12/12, 2014 at 20:40 Comment(5)

Rather than passing your own Python function for computing the correlation coefficient, you can use metric='correlation' which is equal to (1 - correlation coefficient), and is coded in C (so should be much more efficient). – Thriftless 16/3, 2015 at 17:40

He's looking for p-values as well. You won't get the p-values if you use the inbuilt correlation metric. – Cardie 19/3, 2015 at 11:35

You can derive p-values from the correlation coefficients fairly easily (see jingchao's answer and here) – Thriftless 19/3, 2015 at 11:55

(also CT Zhu's answer) – Thriftless 19/3, 2015 at 13:31

This approach fulfilled my needs, and it seems straightforward to me. Please follow any answer that suits you the most. – Cardie 22/3, 2015 at 12:50

If you do not have to use pearson correlation coefficient, you can use the spearman correlation coefficient, as it returns both the correlation matrix and p-values (note that the former requires that your data is normally distributed, whereas the spearman correlation is a non-parametric measure, thus not assuming the normal distribution of your data). An example code:

from scipy import stats
import numpy as np

data = np.array([[0, 1, -1], [0, -1, 1], [0, 1, -1]])
print 'np.corrcoef:', np.corrcoef(data)
cor, pval = stats.spearmanr(data.T)
print 'stats.spearmanr - cor:\n', cor
print 'stats.spearmanr - pval\n', pval

Zebada answered 9/3, 2018 at 6:37 Comment(0)

this is exactly the same performance as the corrcoef in MATLAB:

to have this function work, you will need to install pandas as well as scipy.

# Compute correlation correfficients matrix and p-value matrix
# Similar function as corrcoef in MATLAB
# dframe: pandas dataframe
def corrcoef(dframe):

    fmatrix = dframe.values
    rows, cols = fmatrix.shape

    r = np.ones((cols, cols), dtype=float)
    p = np.ones((cols, cols), dtype=float)

    for i in range(cols):
        for j in range(cols):
            if i == j:
                r_, p_ = 1., 1.
            else:
                r_, p_ = pearsonr(fmatrix[:,i], fmatrix[:,j])

            r[j][i] = r_
            p[j][i] = p_

    return r, p

Eldrida answered 14/5, 2019 at 11:20 Comment(0)

Here is a minimal version of @CT Zhu's answer. We do not need pandas, as correlation can be computed directly from numpy, which should be faster, since we do not need the step of converting to a dataframe

import numpy as np
import scipy.stats as ss

def corr_significance_two_sided(cc, nData):
    # We will divide by 0 if correlation is exactly 1, but that is no problem
    # We would simply set the test statistic to be infinity if it evaluates to NAN
    with np.errstate(divide='ignore'):
        t = -np.abs(cc) * np.sqrt((nData - 2) / (1 - cc**2))
        t[t == np.nan] = np.inf
        return ss.t.cdf(t, nData - 2) * 2  # multiply by two to get two-sided p-value

x = np.random.uniform(0, 1, (8, 1000))
cc = np.corrcoef(x)
pVal = corr_significance_two_sided(cc, 1000)

Impartial answered 1/2, 2021 at 16:48 Comment(0)

In case anyone has a similar problem but your matrix is a pd.DataFrame object, I wrote the following code:

from scipy.stats import pearsonr

def corr_pval(df):
    corr_pval_df = pd.DataFrame(index=df.columns, columns=df.columns)
    for i in range(len(corr_pval_df.index)):
        for c in range(len(corr_pval_df.columns)):
            corr_pval_df.iloc[i, c] = pearsonr(df[corr_pval_df.index[i]], df[corr_pval_df.columns[c]])
    return corr_pval_df
        
 corr_pval(corr_df)

Missile answered 10/1, 2022 at 14:54 Comment(0)

I have the solution! I needed to do this for arrays of size 2000x30,000 and using the above methods or a double loop was just not feasible, especially when it seemed like there was an obvious solution I was missing. So I looked into scipy's implementation of Pearson Correlation to figure out how they calculated the p-value and see if I could optimize it for 2-d arrays. In the notes, they explain that they estimate the PDF of the Pearson correlation coefficient (r) and calculate the two-sided p-value from this 'r'.

Under the assumption that x and y are drawn from independent normal distributions (so the population correlation coefficient is 0), the probability density function of the sample correlation coefficient r is :

$$ f(r) = \frac{\left ( 1-r^2 \right )^{\frac{n}{2}-2}}{B\left ( \frac{1}{2},\frac{n}{2}-1 \right )}$$

where n is the number of samples, and B is the beta function. This is sometimes referred to as the exact distribution of r. For a given sample with correlation coefficient r, the p-value is the probability that abs(r’) of a random sample x’ and y’ drawn from the population with zero correlation would be greater than or equal to abs(r)

This is easily applied to 2-d arrays , I'm surprised that they don't have this functionality in np.corrcoef.

import numpy as np
from scipy import stats
N, L = 100, 2500
signals = np.random.random((N, L)).astype(np.float64)
R = np.corrcoef(signals)

The following was taken directly from the notes of Scipy's pearsonr

dist = stats.beta(L/2 - 1, L/2 - 1, loc=-1, scale=2)
P = 2*dist.cdf(-abs(R))

This method is near-instant, about as fast as np.corrcoef. I checked if the values are right by comparing it the double loop way as well and got this.

testR = np.empty_like(R)
testP = np.empty_like(P)

for i, sigA in enumerate(signals):
    for j, sigB in enumerate(signals):
        testR[i,j], testP[i,j] = pearsonr(sigA, sigB)

print(np.abs(testP - P).mean(), np.abs(testR - R).mean(), np.abs(R - R.T).mean())
#1.3860685982216431e-14 2.8794441450755465e-17 8.459507779608882e-19
#There's probably some float rounding error that causes the discrepancies in the 14th/17th decimal place.

Bidentate answered 8/12, 2023 at 18:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags