DataFrame correlation produces NaN although its values are all integers

Asked 26/3, 2014 at 8:51 Answered 12/4, 2023 at 21:50

Solved python pandas dataframe nan correlation

I have a dataframe df that looks like:

                          0              1       2  3  4  5   6  7   8
0  2014-03-19T12:44:32.695Z  1395233072695  703425  0  2  1  13  5  21
1  2014-03-19T12:44:32.727Z  1395233072727  703425  0  2  1  13  5  21

The columns are all type int (except the first one):

0     object
1      int64
2      int64
3      int64
4      int64
5      int64
6      int64
7      int64
8      int64

But in my correlation, some columns seem to be NaN. When I call df.corr(), I get the following output:

          1    2         3          4    5    6          7         8
1  1.000000  NaN  0.018752  -0.550307  NaN  NaN   0.075191  0.775725
2       NaN  NaN       NaN        NaN  NaN  NaN        NaN       NaN
3  0.018752  NaN  1.000000  -0.067293  NaN  NaN  -0.579651  0.004593
...

Hime answered 26/3, 2014 at 8:51 Comment(2)

Can you provide some more data? Because you have a lot the same values in one column, then it is logical that you get NaN for the correlation – Overfly 26/3, 2014 at 9:40

Those columns do not change in value right now, yes... – Hime 26/3, 2014 at 9:50

Those columns do not change in value right now, yes

As, Joris points out you would expected NaN if the values do not vary. To see why take a look at correlation formula:

cor(i,j) = cov(i,j)/[stdev(i)*stdev(j)]

If the values of the ith or jth variable do not vary, then the respective standard deviation will be zero and so will the denominator of the fraction. Thus, the correlation will be NaN.

Diagnosis answered 26/3, 2014 at 9:58 Comment(2)

Hey, how would you check then that the two are related. By related, I mean if they move together in any way? Like if you would scatter them on a plot, there would be a lot of overlap. – Rambouillet 31/12, 2022 at 12:38

If one of the variables does not vary, it can't be related to a random variable. The variable with always have a standard deviation of zero and a zero covariance with any other variable. – Diagnosis 1/1, 2023 at 5:7

As @Karl D. mentioned, if the values in a column are the same, then there's no variance in that column, i.e. the values are constant and the correlation between a random variable and a constant is undefined. Another way to look at it is if a column with zero variance is plotted in a graph, it will be a horizontal line, which means it doesn't "move together" with any other plot on the same graph.

If we're interested in a correlation matrix, then one way is to simply drop the columns where correlation is not defined:

corr = df.corr()
corr_notna = ~corr.isna().all()
corr = corr.loc[corr_notna, corr_notna]

# or as a one-liner
corr = df.corr().dropna(how='all', axis=1).dropna(how='all')

The difference between a constant column and another random variable will have correlation = 1 or -1. For example,

df = pd.DataFrame({'A': [1]*5, 'B': [*[1]*4, 0]})
df.corr()

df.corr()

     A    B
A  NaN  NaN
B  NaN  1.0



df['B-A'] = df['B'] - df['A']
df[['B', 'B-A']].corr()

       B  B-A
B    1.0  1.0
B-A  1.0  1.0

Another way to get NaN is if there are not enough data to compute correlation especially if min_periods= was set to a number. An example:

df = pd.DataFrame({'A': [*range(9), float('nan')], 'B': range(10)})
df.corr(min_periods=10)


      A   B
A   NaN NaN
B   NaN 1.0

Georgeannageorgeanne answered 12/4, 2023 at 21:50 Comment(0)

Recommended topics

Hot tags