I have a dataframe in Pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation
which nation the article is about, and lang
which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.
index qid subj nation lang metric value
5 Q3488399 economy cdi fr informativeness 0.787117
6 Q3488399 economy cdi fr referencerate 0.000945
7 Q3488399 economy cdi fr completeness 43.200000
8 Q3488399 economy cdi fr numheadings 11.000000
9 Q3488399 economy cdi fr articlelength 3176.000000
10 Q7195441 economy cdi en informativeness 0.626570
11 Q7195441 economy cdi en referencerate 0.008610
12 Q7195441 economy cdi en completeness 6.400000
13 Q7195441 economy cdi en numheadings 7.000000
14 Q7195441 economy cdi en articlelength 2323.000000
I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga']
and three languages ['fr','en','sw']
. So there would be a resulting 4 by 3 matrix like:
en fr sw
usa Cramer11 Cramer12 ...
fra Cramer21 Cramer22 ...
cdi ...
uga ...
Eventually then I will do this over all the different metrics I am tracking.
for subject in list_of_subjects:
for metric in list_of_metrics:
cramer_matrix(metric, df)
Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia.
pd.crosstab(df[column1], df[column2])
, thenn = confusion_matrix.sum()
needs to ben = confusion_matrix.sum().sum()
(numpy
sums along all dimensions,pandas
, along one only. Great answer and very readable code. – Rayerayfield