Pearson correlation and nan values
Asked Answered
E

3

11

I have two CSV_files with hundreds of columns and I want to calculate Pearson correlation coefficient and p value for every same columns of two CSV_files. The problem is that when there is a missing data "NaN" in one column, it gives me an error. When ".dropna" removes nan value from columns, sometimes the shapes of X and Y are not equal (based on removed nan values) and I receive this error:

"ValueError: operands could not be broadcast together with shapes (1020,) (1016,)"

Question: If row #8 in one csv in "nan", is there any way to remove the same row from the other csv too and do the analysis for every column based on rows that have values from both csv files?

import pandas as pd
import scipy
import csv
import numpy as np
from scipy import stats


df = pd.read_csv ("D:/Insitu-Daily.csv",header = None)
dg = pd.read_csv ("D:/Model-Daily.csv",header = None)

pearson_corr_set = []
pearson_p_set = []


for i in range(1,df.shape[1]):
    X= df[i].dropna(axis=0, how='any')
    Y= dg[i].dropna(axis=0, how='any')

    [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X, Y)
    pearson_corr_set = np.append(pearson_corr_set,pearson_corr)
    pearson_p_set = np.append(pearson_p_set,pearson_p)

with open('D:/Results.csv','wb') as file:
    str1 = ",".join(str(i) for i in np.asarray(pearson_corr_set))
    file.write(str1)
    file.write('\n')    
    str1 = ",".join(str(i) for i in np.asarray(pearson_p_set))
    file.write(str1)
    file.write('\n') 
Eventide answered 2/2, 2018 at 22:23 Comment(0)
B
13

Here is one solution. First calculate the "bad" indices for your 2 numpy arrays. Then mask to ignore those bad indices.

x = np.array([5, 1, 6, 9, 10, np.nan, 1, 1, np.nan])
y = np.array([4, 4, 5, np.nan, 6, 2, 1, 8, 1])

bad = ~np.logical_or(np.isnan(x), np.isnan(y))

np.compress(bad, x)  # array([  5.,   1.,   6.,  10.,   1.,   1.])
np.compress(bad, y)  # array([ 4.,  4.,  5.,  6.,  1.,  8.])
Basia answered 2/2, 2018 at 22:40 Comment(1)
I think bad is actually good (it is getting those that are not nans), and you can just do x[bad].Breeks
E
2

Instead of dropna, try using isnan and boolean indexing:

for i in range(1, df.shape[1]):
    df_sub = df[i]
    dg_sub = dg[i]
    mask = ~np.isnan(df_sub) & ~np.isnan(dg_sub)  
    # mask array is now true where ith rows of df and dg are NOT nan.
    X = df_sub[mask]  # this returns a 1D array of length mask.sum()
    Y = df_sub[mask]
    ... your code continues.

Hope that helps!

Essence answered 2/2, 2018 at 22:40 Comment(5)
@Eventide sorry which line?Essence
I tried this and I received this error: TypeError: unsupported operand type(s) for +: 'float' and 'str'. Then I changed my code to: [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X[i], Y[i]), this way it can calculate the correlation for the first column and then stop and give me this error: TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''. It seems it does not work in a loop!!Eventide
@Eventide so I did see that there was one problem with how I was slicing the data with the mask, but I don't know if it will help you out. My suggestion is to make sure that your data arrays (dg/df) are actually matrices of only numbers. If you print df[2] and dg[2], does it print a numpy array of float dtype?Essence
I again received an error which says "TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''". It prints the correlation for i=1 and in next turn in loop, stop on this line "mask = ~np.isnan(df_sub) & ~np.isnan(dg_sub)" and cannot continue the loop.Eventide
I had some "nan" in the format of "Nan". I changed all of them to the same format, and surprisingly it works now!!!!! Thanks a lot.Eventide
T
0

Why not combine them to one single df and just use dropna on it. all values will be removed.

newdf=pd.concat([df, dg], axis=1, sort=False)
newdf.dropna()

I suggest to get a list of column names of both df, and use that in the for loop:

dfnames=list(df.columns.values)
dgnames=list(dg.columns.values)
for i in range(len(dfnames)):
    X= newdf[dfnames[i]].dropna(axis=0, how='any')
    Y= newdf[dgnames[i]].dropna(axis=0, how='any')

    [pearson_corr, pearson_p] = scipy.stats.stats.pearsonr(X, Y)
    pearson_corr_set = np.append(pearson_corr_set,pearson_corr)
    pearson_p_set = np.append(pearson_p_set,pearson_p)

also, you can just csv withtout that for loop. read pandas.DataFrame.to_csv

Topside answered 18/7, 2019 at 8:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.