Correlation among multiple categorical variables

Asked 30/12, 2017 at 15:43 Answered 31/10, 2023 at 21:37

Solved python pandas heatmap correlation categorical-data

I have a data set made of 22 categorical variables (non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function

DataFrame.corr(method='pearson', min_periods=1)

only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step (rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):

         cat1  cat2  cat3  
  cat1|  coef  coef  coef  
  cat2|  coef  coef  coef
  cat3|  coef  coef  coef

Any ideas with pd.pivot_table or something in the same vein?

Davina answered 30/12, 2017 at 15:43 Comment(0)

You can using pd.factorize

df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
Out[32]: 
     a    c    d
a  1.0  1.0  1.0
c  1.0  1.0  1.0
d  1.0  1.0  1.0

Data input

df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})

Update

from scipy.stats import chisquare

df=df.apply(lambda x : pd.factorize(x)[0])+1

pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])

Out[123]: 
     0    1    2    3
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0

df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})

Oystercatcher answered 30/12, 2017 at 15:49 Comment(6)

sounds like a good plan but, from what I understood, I can't use pearson on categorical data. Would it be possible to modify this code to end up with chi-squared? – Davina 30/12, 2017 at 16:5

@DavidZarebski docs.scipy.org/doc/scipy/reference/generated/… – Oystercatcher 30/12, 2017 at 16:8

I saw it but I end up with a (8124, 22) matrix instead of the (22,22) I am looking for. (I have 8124 observation). If you see what I mean – Davina 30/12, 2017 at 17:3

@DavidZarebski you can check this one : -) codereview.stackexchange.com/questions/96761/… – Oystercatcher 30/12, 2017 at 17:22

@DavidZarebski isn't the Pearson test's full name the Pearson's chi-squared test ? I think you might be overly complicating things by avoiding it. – Gesner 12/7, 2018 at 13:20

@robertotomás There is something called the Pearson's chi-squared test (which leads to some confusions sometimes (en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)) and yes, it is intended to measure the correlation between categorical variables (like a regular Chi2). However, it seems to me that it differs from the so called Pearson correlation (resp. Kendall, Spearman) (see (en.wikipedia.org/wiki/Pearson_correlation_coefficient)) intended to apply to numerical variables. Calling the .corr(method='pearson') method in pandas involves the latter. – Davina 12/7, 2018 at 13:33

Turns out, the only solution I found is to iterate trough all the factor*factor pairs.

factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values] 

chi2, p_values =[], []

for f in factors_paired:
    if f[0] != f[1]:
        chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))   
        chi2.append(chitest[0])
        p_values.append(chitest[1])
    else:      # for same factor pair
        chi2.append(0)
        p_values.append(0)

chi2 = np.array(chi2).reshape((23,23)) # shape it as a matrix
chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience

Davina answered 31/12, 2017 at 15:20 Comment(4)

zar3bski Where is Chitest defined here? – Manns 24/4, 2020 at 15:58

here I would say (it's been a while). I must have included something like from scipy.stats import chi2_contingency in the beginning of the script (which I do not have anymore) – Davina 24/4, 2020 at 16:4

why do you use chitest – Hafler 4/2, 2021 at 18:24

why did you use 23, 23 to reshape the array, is it because OP has mentioned he has 22 categorical columns? – Aragonite 19/4, 2021 at 4:0

Using association-metrics python package to calculate Cramér's coefficient matrix from a pandas.DataFrame object it's quite simple; let me show you:

First install association_metrics using:

pip install association-metrics

Then, you can use the following pseudocode

# Import association_metrics  
import association_metrics as am
# Convert you str columns to Category columns
df = df.apply(
        lambda x: x.astype("category") if x.dtype == "O" else x)

# Initialize a CamresV object using you pandas.DataFrame
cramersv = am.CramersV(df) 
# will return a pairwise matrix filled with Cramer's V, where columns and index are 
# the categorical variables of the passed pandas.DataFrame
cramersv.fit()

Piste answered 18/7, 2022 at 19:52 Comment(0)

You can use scipy.stats.chi2_contingency to find statistic based on Chi Squared Test. And then you calculate Cramers V formula as in the cramers_V function.

This solution has only one for loop to iterate through rows and it uses apply to iterate through columns so it might be somewhat efficient.

from scipy.stats import chi2_contingency

df = pd.DataFrame({'A':['a1','a2','a1','a1','a1'],
                 'B':['b1','b2','b2','b2','b2'],
                 'C':['c1','c2','c1','c1','c1'],
                 'D':['d1','d2','d1','d2','d1']})

def cramers_V(var1, var2):
    crosstab = np.array(pd.crosstab(var1, var2))
    stats = chi2_contingency(crosstab)[0]
    cram_V = stats / (np.sum(crosstab) * (min(crosstab.shape) - 1))
    return cram_V

def cramers_col(column_name):
    col = pd.Series(np.empty(df.columns.shape), index=df.columns, name=column_name)
    for row in df:
        cram = cramers_V(df[column_name], df[row])
        col[row] = round(cram, 2)
    return col

df.apply(lambda column: cramers_col(column.name))

The result for each pair of features will range from 0 to 1, the stronger correlation - the higher value.

Output:

    A       B       C       D
A   0.14    0.00    0.14    0.01
B   0.00    0.14    0.00    0.00
C   0.14    0.00    0.14    0.01
D   0.01    0.00    0.01    0.34

As expected, A-C correlation is the most significant.

Also, if Chi squared test is calculated by hand for this example (using degrees of freedom and all), A-C correlation is the strongest there as well.

Thermograph answered 31/10, 2023 at 21:37 Comment(0)

Recommended topics

Hot tags