Pandas method corr() use not all features
Asked Answered
N

2

-5

I have dataframe with shape (335539, 26). So I have 26 features. But when i use

data.corr() 

I get a 12 x 12 matrix.

What can be wrong? `

Nemesis answered 16/8, 2018 at 12:37 Comment(8)
I don't think there is any way for us to help you with the details you've given. Please add an MCVEFossil
You should add a dataframe to illustrate your issue. First, are you sure your dataframe is filled with numerical values?Vachill
First guess is that some of your features are categorical, and they are not used for the correlation calculation...Florineflorio
@Florineflorio yes, i have categorical features. How can i transform them to get right correlation matrix?Nemesis
@IlyaLinetski transform them to what?Fossil
The "right" correlation matrix is the one you get; non-numerical features are automatically left out.Florineflorio
@Fossil to numbers, i think. Can i just numerate them from 1 to n ? Is it a good approach? How do you get correlation matrix when you have categorical features?Nemesis
@IlyaLinetski If you pick arbitrary integers to categorise your data, is it logical that you will get meaningful correlations? I'm not aware of any way to get a result other than what you haveFossil
T
1

Pearson co-relation can only be used with continuous data. There is no point of changing the categorical features to numerate between 1 to n for various reasons. You can change them to numerical using one hot encoding technique or dummy variables technique. It is not clear as to between what type of data features you are trying to find a co-relation. If you are trying to find co-relation between nominal variable and continuous variable, it is better called measure of association and you can calculate that using ANOVA which has built in implementation in scipy library. If its between ordinal variable and continuous variable you can use Spearman's co-relation method.

If still you want to find co-relation using corr() try converting your data with the above methods I mentioned, although I am not sure if you will get correct results.

Its better to first formulate your question properly and then look for the specific test which support your sample space.

corr() takes only numerical data and thus you only find the co-relation between your numerical features.

Tardiff answered 16/8, 2018 at 14:27 Comment(0)
W
1

It appears that there are some non-numeric values in the 'data' column that have an 'object' data type, which will not show in corr().

data.dtypes

To solve this, you can handle the categorical features with either get_dummies or one-hot encoding approaches. Additionally, convert other numerical features that are of 'object' data type using the following code:

data['x'] = pd.to_numeric(data['x'], errors='coerce')

keep in mind to convert to numeric before replacing any missing values with np.na:

data['x'] = pd.to_numeric(df_['x'], errors='coerce').astype('float64')
data['Tenure'] = data['x'].apply(lambda x: x if x >= 0 else  np.nan)
Wend answered 29/7, 2023 at 0:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.