Corrected Cramer's V results in division by zero when n = r
Asked Answered
M

1

1

I recently found this answer which provides the code of an unbiased version of Cramer's V for computing the correlation of two categorical variables:

import scipy.stats as ss

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))

However, if the number of samples, n, is equal to the number of categories of the first feature, r, then rcorr = n - (n-1) = 1, which yields a division by zero in np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)) if (kcorr-1) is non-negative. I confirmed this with a simple example:

import pandas as pd

data = [
    {'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
    {'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
    {'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
    {'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
    ]

df = pd.DataFrame(data) 

confusion_matrix = pd.crosstab(df['name'], df['occupation']) # n = 4 (number of samples), r = 4 (number of unique names), k = 3 (number of unique occupations)
print(cramers_corrected_stat(confusion_matrix))

Output:

/tmp/ipykernel_227998/749514942.py:45: RuntimeWarning: invalid value encountered in scalar divide
  return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
nan

Is this expected behavior?

If so, how should I use the corrected Cramer's V in cases where n = k, e.g., when all samples have a unique value for some feature?

Mireille answered 12/4 at 22:20 Comment(0)
C
1

You can handle the division by zero when n=r by introducing a small perturbation. I modified your function this way:

Your original function:

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))

becomes

def cramers_corrected_stat(confusion_matrix):
    """Calculate Cramers V statistic for categorical-categorical association.
       Uses correction from Bergsma and Wicher,
       Journal of the Korean Statistical Society 42 (2013): 323-328.
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()  
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    
    denominator = min((kcorr-1), (rcorr-1))
    if denominator <= 0:
        return 0
    else:
        return np.sqrt(phi2corr / denominator)

Whith your sample data (with n = 4, r = 4, k=3):

data = [
    {'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
    {'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
    {'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
    {'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
]

df = pd.DataFrame(data)

confusion_matrix = pd.crosstab(df['name'], df['occupation']) 
result = cramers_corrected_stat(confusion_matrix)
print(f"Cramer's V Result: {result}")

you'd get

Cramer's V Result: 0

To handle the corner case where n = k = r I update the function with

import numpy as np
import pandas as pd
import scipy.stats as ss

def cramers_corrected_stat(confusion_matrix):
    """Calculate Cramers V statistic for categorical-categorical association.
       Uses correction from Bergsma and Wicher,
       Journal of the Korean Statistical Society 42 (2013): 323-328.
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()  
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    
    if rcorr <= 1 and kcorr <= 1: 
        return 0
    
    denominator = min((kcorr-1), (rcorr-1))
    if denominator <= 0:
        return 1
    else:
        return np.sqrt(phi2corr / denominator)

# Sample data
data = [
    {'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'},
    {'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'},
    {'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'},
    {'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'},
]

df = pd.DataFrame(data)

confusion_matrix = pd.crosstab(df['name'], df['occupation']) 
result = cramers_corrected_stat(confusion_matrix)
print(f"Cramer's V Result: {result}")

Crosspiece answered 13/4 at 5:16 Comment(3)
Thanks for the answer. In your code, the correlation function will be zero whenever the denominator is non-positive. I wonder if this could be a problem in specific cases. For instance, say we remove Doug from the data so that now, n = r = k = 3. The correlation between 'name' and 'occupation' should be 1 as each name is matched with a different occupation and vice-versa. However, the denominator is zero and the function will return 0.Mireille
@GabrielRebello Yes, you are right. I updated my answer with a function handling these corner cases.Crosspiece
I don't think the update solved the problem. Note that whenever n = r = k, then rcorr = kcorr = 1, which triggers if rcorr <= 1 and kcorr <= 1: return 0 in your code. As n = r = k implies every unique name matches a unique occupation (otherwise, r > k or k > r), I think we can safely assume a correlation of 1 in this case. Also, when n = r != k or n = k != r, the classical version of Cramer's V yields a correlation of 1 regardless of n. Thus, I suspect correcting the line if denominator <= 0: return 0 to if denominator <= 0: return 1 is enough to address all corner cases.Mireille

© 2022 - 2024 — McMap. All rights reserved.