I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut
of pandas
.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix()
is deprecated in pandas since verison 0.23.0 . use .values
instead
n = confusion_matrix.sum().sum()
instead of n = confusion_matrix.sum()
in newer versions of pandas
. Maybe that's why the downvoting. –
Magnetostriction I found phik
library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
I was looking to do same thing in BigQuery. For numeric features you can use built in CORR(x,y) function. For categorical features, you can calculate it as: cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2). Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.
© 2022 - 2024 — McMap. All rights reserved.