As @JAgustinBarrachina pointed out, the accepted answer introduces a bias because it uses the Pearson correlation method under the hood.
The categorization of each column may produce the following:
- media lawyer --> 0
- student --> 1
- Professor --> 2
Because the Pearson method computes linear correlation, it will compute the distance between each category. From the algorithm point of view a media lawyer
will be more different from a professor
(their distance is 2 - 0 = 2) than it will be different from a student
( 1 - 0 = 1). That is not true in this case so the resulting correlation would be biased.
From the docs 2 other correlation methods are available : The Kendall and the Spearman methods. But they both suppose an the categories are ordered.
For example a category such as revenue : ["low", "medium", "high"]
could be considered as ordered.
If there is no order between the categories of a column, a method using Chi² and Cramér's V is more appropriate:
import scipy.stats as ss
import pandas as pd
from pandas import DataFrame, Series
profession_and_media = DataFrame(data = {
# Decupling data to simulate significance
"profession" : ["media lawyer" , "student" , "student" , "professor" , "media lawyer"] * 10,
"media" : ["print" , "online" , "print" , "online" , "online"] * 10
})
def cramers_corrected_stat(columnA: Series, columnB: Series):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
confusion_matrix = pd.crosstab(columnA, columnB)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.to_numpy().sum(axis=None)
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
def compute_category_correlation(df: DataFrame):
""" Compute the correlation between string columns of a DataFrame
"""
for column in df.columns:
df.loc[:, column] = df[column].astype('category').cat.codes
result = df.corr(method=cramers_corrected_stat)
return result.style.background_gradient(cmap='Reds')
compute_category_correlation(profession_and_media)
Which gives