I recently found this answer which provides the code of an unbiased version of Cramer's V for computing the correlation of two categorical variables:
import scipy.stats as ss
def cramers_corrected_stat(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))
However, if the number of samples, n
, is equal to the number of categories of the first feature, r
, then rcorr = n - (n-1) = 1
, which yields a division by zero in np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))
if (kcorr-1)
is non-negative. I confirmed this with a simple example:
import pandas as pd
data = [
{'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
{'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
{'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
{'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
]
df = pd.DataFrame(data)
confusion_matrix = pd.crosstab(df['name'], df['occupation']) # n = 4 (number of samples), r = 4 (number of unique names), k = 3 (number of unique occupations)
print(cramers_corrected_stat(confusion_matrix))
Output:
/tmp/ipykernel_227998/749514942.py:45: RuntimeWarning: invalid value encountered in scalar divide
return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
nan
Is this expected behavior?
If so, how should I use the corrected Cramer's V in cases where n = k
, e.g., when all samples have a unique value for some feature?
n = r = k = 3
. The correlation between 'name' and 'occupation' should be 1 as each name is matched with a different occupation and vice-versa. However, the denominator is zero and the function will return 0. – Mireille