Categorical features correlation

Asked 30/9, 2017 at 0:37 Answered 24/12, 2020 at 13:50

Solved pandas machine-learning categorical-data feature-engineering

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?

Meanie answered 30/9, 2017 at 0:37 Comment(2)

Binary or n-ary categorical? Ordered or unordered? – Leonidaleonidas 13/2, 2018 at 22:37

"correlation of it to labels" => correlation of it to a categorical response variable (how many values?) – Leonidaleonidas 13/2, 2018 at 22:38

There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.

import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns

print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0

tips = sns.load_dataset("tips")

tips["total_bill_cut"] = pd.cut(tips["total_bill"],
                                np.arange(0, 55, 5),
                                include_lowest=True,
                                right=False)

def cramers_v(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221

confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837

please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead

Watson answered 30/9, 2017 at 1:40 Comment(11)

Thanks for reply, but my question was not how to calculate the correlation between categorical features. Question is: Is it a good idea or terribly bad idea to use hot encoders for categorical features and then using the features including categorical & continuous ones to calculate correlation. – Meanie 30/9, 2017 at 3:4

I am sorry for misunderstanding the question. I think there is no problem to calculate the correlation between one hot encoding feature and another continuous feature, but I think that the correlation coefficient will be a value only for one item of the category. – Watson 30/9, 2017 at 3:14

Thank you very much – Meanie 30/9, 2017 at 16:39

however, as me being a newbie... would you mind explaining a little bit about that, why would it use only one item from the category. – Meanie 30/9, 2017 at 16:41

Sorry for lack of explanation. Suppose you have one-hot encoded an item of gender. Gender items are separated into male and female and other variables. The male variable is a flag of 0 or 1, whether it is male or not. I think it is possible to correlate with these flag variables. In general, however, correlation coefficients for categorical variables use statistical analysis methods using statistics such as frequency of categories of items before one-hot encoding. See also stats.stackexchange.com/questions/102778/… – Watson 1/10, 2017 at 3:38

Can Cramers V statistic be used to calculate the correlation between a category feature and a label whose value is 0 or 1? – Communicative 15/6, 2020 at 2:16

@Communicative It is possible. However, if you want to see the correlation between the binary label of 0/1 and the categorical variables, it is better to see the correlation with the order by the variable importance of LightGBM or RandomForest, or AIC. – Watson 15/6, 2020 at 2:39

@Watson do I calculate correlation of categorical features along with numeric value or calculate only with categorical features with target variable? – Rhinal 25/6, 2020 at 19:6

@Shantanu Nath You can calculate correlations if you can calculate confusion matrix, not just categorical features along with numeric value. And it's not just the target variables. – Watson 26/6, 2020 at 3:24

@Keiku, I just saw that it has to be n = confusion_matrix.sum().sum() instead of n = confusion_matrix.sum() in newer versions of pandas. Maybe that's why the downvoting. – Magnetostriction 13/7, 2021 at 13:4

It might be a useful blog for anyone searching "how to report effect size of different tests like ANOVA, Chi-square in the range of 0 to 1". Here: towardsdatascience.com/… – Pinery 9/12, 2022 at 7:52

I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation

Bithia answered 24/12, 2020 at 13:50 Comment(0)

I was looking to do same thing in BigQuery. For numeric features you can use built in CORR(x,y) function. For categorical features, you can calculate it as: cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2). Which translates to following SQL:

SELECT 
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...

Higher number means lower correlation.

I used following python script to generate SQL:

import itertools

arr = range(1,10)

query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b) 
  for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM  `...`;'
print (query)

It should be straightforward to do same thing in numpy.

Jitter answered 14/2, 2020 at 22:16 Comment(0)

Recommended topics

Hot tags