Using pandas, calculate Cramér's coefficient matrix
Asked Answered
P

6

37

I have a dataframe in Pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is about, and lang which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.

index   qid     subj    nation  lang    metric          value
5   Q3488399    economy     cdi     fr  informativeness 0.787117
6   Q3488399    economy     cdi     fr  referencerate   0.000945
7   Q3488399    economy     cdi     fr  completeness    43.200000
8   Q3488399    economy     cdi     fr  numheadings     11.000000
9   Q3488399    economy     cdi     fr  articlelength   3176.000000
10  Q7195441    economy     cdi     en  informativeness 0.626570
11  Q7195441    economy     cdi     en  referencerate   0.008610
12  Q7195441    economy     cdi     en  completeness    6.400000
13  Q7195441    economy     cdi     en  numheadings     7.000000
14  Q7195441    economy     cdi     en  articlelength   2323.000000

I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga'] and three languages ['fr','en','sw']. So there would be a resulting 4 by 3 matrix like:

       en         fr          sw
usa    Cramer11   Cramer12    ... 
fra    Cramer21   Cramer22    ... 
cdi    ...
uga    ...

Eventually then I will do this over all the different metrics I am tracking.

for subject in list_of_subjects:
    for metric in list_of_metrics:
        cramer_matrix(metric, df)

Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia.

Plea answered 2/1, 2014 at 22:6 Comment(0)
B
60

cramers V seems pretty over optimistic in a few tests that I did. Wikipedia recommends a corrected version.

import scipy.stats as ss

def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

Also note that the confusion matrix can be calculated via a built-in pandas method for categorical columns via:

import pandas as pd
confusion_matrix = pd.crosstab(df[column1], df[column2])
Branscum answered 1/9, 2016 at 8:15 Comment(4)
This was great @Ziggy! One heads up, if the confusion matrix is calculated using pd.crosstab(df[column1], df[column2]), then n = confusion_matrix.sum() needs to be n = confusion_matrix.sum().sum() (numpy sums along all dimensions, pandas, along one only. Great answer and very readable code.Rayerayfield
I think the above function expects a 2d numpy array as input, not a pandas object. It might work with confusion_matrix = pd.crosstab(df[column1], df[column2]).to_numpy()Hinny
A quick hack is .to_numpy().sum() instead of .sum().sum(), it should be faster too.Comedy
Does this work if n = k? See #78319455Bridgeman
D
13

A bit modificated function from Ziggy Eunicien answer. 2 modifications added

  1. Checking if one of the variables is constant
  2. Correction to ss.chi2_contingency(conf_matrix, correction=correct) - FALSE if confusion matrix is 2x2

Python

import scipy.stats as ss
import pandas as pd
import numpy as np

def cramers_corrected_stat(x,y):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    result=-1
    if len(x.value_counts())==1 :
        print("First variable is constant")
    elif len(y.value_counts())==1:
        print("Second variable is constant")
    else:   
        conf_matrix=pd.crosstab(x, y)
            
        if conf_matrix.shape[0]==2:
            correct=False
        else:
            correct=True
    
        chi2 = ss.chi2_contingency(conf_matrix, correction=correct)[0]
            
        n = sum(conf_matrix.sum())
        phi2 = chi2/n
        r,k = conf_matrix.shape
        phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
        rcorr = r - ((r-1)**2)/(n-1)
        kcorr = k - ((k-1)**2)/(n-1)
        result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
    return round(result,6)
Disciplinarian answered 11/2, 2019 at 7:16 Comment(2)
hi, why do u need to add this line of code? phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) rcorr = r - ((r-1)**2)/(n-1) kcorr = k - ((k-1)**2)/(n-1) result=np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))) according to wiki, we can just used the phi2/ min(k-1,r-1)Repetend
@R_abcdefg: it is bias corrections according to en.wikipedia.org/wiki/Cram%C3%A9r%27s_VSaied
S
8

Cramer's V statistic allows to understand correlation between two categorical features in one data set. So, it is your case.

To calculate Cramers V statistic you need to calculate confusion matrix. So, solution steps are:
1. Filter data for a single metric
2. Calculate confusion matrix
3. Calculate Cramers V statistic

Of course, you can do those steps in loop nest provided in your post. But in your starting paragraph you mention only metrics as an outer parameter, so I am not sure that you need both loops. Now, I will provide code for steps 2-3, because filtering is simple and as I mentioned I am not sure what you certainely need.

Step 2. In the code below data is a pandas.dataFrame filtered by whatever you want on step 1.

import numpy as np

confusions = []
for nation in list_of_nations:
    for language in list_of_languges:
        cond = data['nation'] == nation and data['lang'] == language
        confusions.append(cond.sum())
confusion_matrix = np.array(confusions).reshape(len(list_of_nations), len(list_of_languges))

Step 3. In the code below confusion_matrix is a numpy.ndarray obtained on step 2.

import numpy as np
import scipy.stats as ss

def cramers_stat(confusion_matrix):
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    return np.sqrt(chi2 / (n*(min(confusion_matrix.shape)-1)))

result = cramers_stat(confusion_matrix)

This code was tested on my data set, but I hope it is ok to use it without changes in your case.

Sepalous answered 25/6, 2016 at 19:16 Comment(0)
E
5

Using association-metrics python package to calculate Cramér's coefficient matrix from a pandas.DataFrame object it's quite simple, let me show you:

firts install association_metrics using:

pip install association-metrics

Then, you can use the following pseudocode

# Import association_metrics  
import association_metrics as am
# Convert you str columns to Category columns
df = df.apply(
        lambda x: x.astype("category") if x.dtype == "O" else x)

# Initialize a CamresV object using you pandas.DataFrame
cramersv = am.CramersV(df) 
# will return a pairwise matrix filled with Cramer's V, where columns and index are 
# the categorical variables of the passed pandas.DataFrame
cramersv.fit()

Package info

Ednaedny answered 3/2, 2021 at 9:29 Comment(0)
A
3

Let's not reinvent the wheel! Scipy already has a function.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.contingency.association.html


import numpy as np
from scipy.stats.contingency import association
obs4x2 = np.array([[100, 150], [203, 322], [420, 700], [320, 210]])

association(obs4x2, method="cramer")
0.18617813077483678
Armory answered 27/6, 2023 at 17:16 Comment(0)
G
-1

There is a far simpler answer. So the question is on Cramer's V, and I will stick to answering this.

For your pandas DataFrame: data, if you're only interested in the language and nation columns you can easily get a heatmap of Cramer's V using the simple few lines below:

# first chose your category columns of interest
df = data[['nation', 'lang']]

# now change this to dummy variables, one-hot encoded:
DataMatrix = pd.get_dummies(df)

# plot as simply as:
plt.figure(figsize=(15,12))  # for large datasets
plt.title('Cramer\'s V comparing nation and language')
sns.heatmap(DataMatrix.corr('pearson'), cmap='coolwarm', center=0)

Alternatives I can recommend are: 2 by 2 chi-squared tests of proportions, or asymmetric normalised mutual information (NMI or Theil's U).

Grassplot answered 10/12, 2020 at 14:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.