Pearson correlation and nan values

Asked 2/2, 2018 at 22:23 Answered 18/7, 2019 at 8:13

python arrays numpy nan pearson-correlation

I have two CSV_files with hundreds of columns and I want to calculate Pearson correlation coefficient and p value for every same columns of two CSV_files. The problem is that when there is a missing data "NaN" in one column, it gives me an error. When ".dropna" removes nan value from columns, sometimes the shapes of X and Y are not equal (based on removed nan values) and I receive this error:

"ValueError: operands could not be broadcast together with shapes (1020,) (1016,)"

Question: If row #8 in one csv in "nan", is there any way to remove the same row from the other csv too and do the analysis for every column based on rows that have values from both csv files?

import pandas as pd
import scipy
import csv
import numpy as np
from scipy import stats


df = pd.read_csv ("D:/Insitu-Daily.csv",header = None)
dg = pd.read_csv ("D:/Model-Daily.csv",header = None)

pearson_corr_set = []
pearson_p_set = []


for i in range(1,df.shape[1]):
    X= df[i].dropna(axis=0, how='any')
    Y= dg[i].dropna(axis=0, how='any')

    [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X, Y)
    pearson_corr_set = np.append(pearson_corr_set,pearson_corr)
    pearson_p_set = np.append(pearson_p_set,pearson_p)

with open('D:/Results.csv','wb') as file:
    str1 = ",".join(str(i) for i in np.asarray(pearson_corr_set))
    file.write(str1)
    file.write('\n')    
    str1 = ",".join(str(i) for i in np.asarray(pearson_p_set))
    file.write(str1)
    file.write('\n')

Eventide answered 2/2, 2018 at 22:23 Comment(0)

Here is one solution. First calculate the "bad" indices for your 2 numpy arrays. Then mask to ignore those bad indices.

x = np.array([5, 1, 6, 9, 10, np.nan, 1, 1, np.nan])
y = np.array([4, 4, 5, np.nan, 6, 2, 1, 8, 1])

bad = ~np.logical_or(np.isnan(x), np.isnan(y))

np.compress(bad, x)  # array([  5.,   1.,   6.,  10.,   1.,   1.])
np.compress(bad, y)  # array([ 4.,  4.,  5.,  6.,  1.,  8.])

Basia answered 2/2, 2018 at 22:40 Comment(1)

I think bad is actually good (it is getting those that are not nans), and you can just do x[bad]. – Breeks 18/3, 2021 at 20:36

Instead of dropna, try using isnan and boolean indexing:

for i in range(1, df.shape[1]):
    df_sub = df[i]
    dg_sub = dg[i]
    mask = ~np.isnan(df_sub) & ~np.isnan(dg_sub)  
    # mask array is now true where ith rows of df and dg are NOT nan.
    X = df_sub[mask]  # this returns a 1D array of length mask.sum()
    Y = df_sub[mask]
    ... your code continues.

Hope that helps!

Essence answered 2/2, 2018 at 22:40 Comment(5)

@Eventide sorry which line? – Essence 5/2, 2018 at 16:18

I tried this and I received this error: TypeError: unsupported operand type(s) for +: 'float' and 'str'. Then I changed my code to: [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X[i], Y[i]), this way it can calculate the correlation for the first column and then stop and give me this error: TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''. It seems it does not work in a loop!! – Eventide 5/2, 2018 at 16:29

@Eventide so I did see that there was one problem with how I was slicing the data with the mask, but I don't know if it will help you out. My suggestion is to make sure that your data arrays (dg/df) are actually matrices of only numbers. If you print df[2] and dg[2], does it print a numpy array of float dtype? – Essence 5/2, 2018 at 16:37

I again received an error which says "TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''". It prints the correlation for i=1 and in next turn in loop, stop on this line "mask = ~np.isnan(df_sub) & ~np.isnan(dg_sub)" and cannot continue the loop. – Eventide 6/2, 2018 at 17:22

I had some "nan" in the format of "Nan". I changed all of them to the same format, and surprisingly it works now!!!!! Thanks a lot. – Eventide 14/2, 2018 at 16:49

Why not combine them to one single df and just use dropna on it. all values will be removed.

newdf=pd.concat([df, dg], axis=1, sort=False)
newdf.dropna()

I suggest to get a list of column names of both df, and use that in the for loop:

dfnames=list(df.columns.values)
dgnames=list(dg.columns.values)
for i in range(len(dfnames)):
    X= newdf[dfnames[i]].dropna(axis=0, how='any')
    Y= newdf[dgnames[i]].dropna(axis=0, how='any')

    [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X, Y)
    pearson_corr_set = np.append(pearson_corr_set,pearson_corr)
    pearson_p_set = np.append(pearson_p_set,pearson_p)

also, you can just csv withtout that for loop. read pandas.DataFrame.to_csv

Topside answered 18/7, 2019 at 8:13 Comment(0)

Recommended topics

Hot tags