Inter-rater reliability calculation for multi-raters data

Asked 6/6, 2019 at 15:59 Answered 25/11, 2021 at 20:41

I have the following list of lists:

[[1, 1, 1, 1, 3, 0, 0, 1],
 [1, 1, 1, 1, 3, 0, 0, 1],
 [1, 1, 1, 1, 2, 0, 0, 1],
 [1, 1, 0, 2, 3, 1, 0, 1]]

Where I want to calculate an inter-rater reliability score, there are multiple raters(rows). I cannot use Fleiss' kappa, since the rows do not sum to the same number. What is a good approach in this case?

On answered 6/6, 2019 at 15:59 Comment(1)

one more thing: "The condition of random sampling among raters makes Fleiss' kappa not suited for cases where all raters rate all patients" (wiki.com/fleiss) – in other words first subject must be rated by a different set of raters (A,B,C) than second subject (A,Y,Z). However, this seems to be forgotten in most tutorials and is nearly impossible to verify in published research. en.wikipedia.org/wiki/Fleiss%27_kappa # statistics.laerd.com/spss-tutorials/… – Hardball 23/10, 2022 at 21:0

Yes, data preparation is key here. Let's walk through it together.

While Krippendorff's alpha may be superior for any number of reasons, numpy and statsmodels provide everything you need to get Fleiss kappa from the above mentioned table. Fleiss' kappa is more prevalent in medical research despite Krippendorff alpha delivering mostly the same result if used correctly. If they deliver substantially different results this might be due to a number of user errors, most importantly format of input data and level of measurement (eg. ordinal vs. nominal) – skip ahead for the solution (transpose & aggregate): Fleiss kappa 0.845

pay close attention to which axis represents subject, rater or category !

Fleiss' kappa

statsmodels.stats import inter_rater as irr

The original data had raters as rows and subjects as columns with the integers representing the assigned categories (if I'm not mistaken).

I removed one row because there were 4 rows and 4 categories which may confuse the situation – so now we have 4 [0,1,2,3] categories and 3 rows.

orig = [[1, 1, 1, 1, 3, 0, 0, 1],
        [1, 1, 1, 1, 3, 0, 0, 1],
        [1, 1, 1, 1, 2, 0, 0, 1]]

From the documentation of the aggregate_raters() function

"convert raw data with shape (subject, rater) to (subject, cat_counts)"

irr.aggregate_raters(orig)

This returns:

(array([[2, 5, 0, 1],
        [2, 5, 0, 1],
        [2, 5, 1, 0]]),
array([0, 1, 2, 3]))

now… the number of rows in the orig array is equal to the number of rows in the first of the returned arrays (3). The number of columns is now equal to the number of categories ([0,1,2,3] -> 4). The contents of each row add up to 8, which equals the number of columns in the orig input data – assuming every rater rated every subject. This aggregation shows how the raters are distributed across the categories (columns) for each subject (row). (If agreement was perfect on category 2 we would see [0,0,8,0]; or category 0 [8,0,0,0].

The function expects the rows to be subjects. See how the number of subjects has not changed (3 rows). And for each subject it counted how many times each category was assigned by 'looking' how many times the category (number) is found in the row. For the first row or category 0 was assigned twice, 1 five times, 2 none, 3 once

[1, 1, 1, 1, 3, 0, 0, 1] -> [2, 5, 0, 1]

The second array returns the category values. If we replace both 3s in the input array with 9s the distribution looks the same but the last category has changed.

ori9 = [[1, 1, 1, 1, 9, 0, 0, 1],
        [1, 1, 1, 1, 9, 0, 0, 1],
        [1, 1, 1, 1, 2, 0, 0, 1]]

(array([[2, 5, 0, 1],
        [2, 5, 0, 1],
        [2, 5, 1, 0]]),
array([1, 2, ,3, 9]))      <- categories

aggregate_raters() returns a tuple of ([data], [categories])

In the [data] the rows stay subjects. aggregate_raters() turns columns from raters into categories. Fleiss' expects the 'table' data to be in this (subject, category) format: https://en.wikipedia.org/wiki/Fleiss'_kappa#Data

Now to the solution of the problem:

What happens if we plug the original data into Fleiss kappa: (we just use the data 'dats' not the category list 'cats')

dats, cats = irr.aggregate_raters(orig)
irr.fleiss_kappa(dats, method='fleiss')

-0.12811059907834096

But... why? Well, look at the orig data – aggregate_raters() is assuming raters as columns ! This means that we have perfect disagreement e.g. between the first column and the second to last column – Fleiss thinks: "first rater always rated "1" and second to last always rated "0" -> perfect disagreement on all three subjects.

So what we need to do is (sorry I'm a noob – might not be the most elegant):

giro = np.array(orig).transpose()
giro

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [3, 3, 2],
       [0, 0, 0],
       [0, 0, 0],
       [1, 1, 1]])

Now we have subjects as rows and raters as columns (three raters assigning 4 categories). What happens if we plug this in into the aggregate_raters() function and feed the resulting data into fleiss ? (using index 0 to grab first part of returned tuple)

irr.fleiss_kappa(irr.aggregate_raters(giro)[0], method='fleiss')

0.8451612903225807

Finally … this makes more sense, if all three raters agreed perfectly except on subject 5 [3, 3, 2].

Krippendorff's alpha

The current krippendorff implementation expects the data in the orig format with raters as rows and columns as subjects – no aggregation function needed to prepare the data. So I can see how this was the simpler solution. Fleiss is still very prevalent in medical research, so lets see how it compares:

import krippendorff as kd
kd.alpha(orig)

0.9359

Wow… that's a lot higher than Fleiss' kappa... Well, we need to tell Krippendorff the "Steven's level of measurement of the variable. It must be one of 'nominal', 'ordinal', 'interval', 'ratio' or a callable." – this is for the 'difference function' of Krippendorff's alpha. https://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=asc_papers

kd.alpha(orig, level_of_measurement='nominal')

0.8516

Hope this helps, I learned a lot writing this.

Hardball answered 25/11, 2021 at 20:41 Comment(0)

One answer to this problem is to use krippendorff alpha score:

Wikipedia Description

Python Library

import krippendorff

arr = [[1, 1, 1, 1, 3, 0, 0, 1],
       [1, 1, 1, 1, 3, 0, 0, 1],
       [1, 1, 1, 1, 2, 0, 0, 1],
       [1, 1, 0, 2, 3, 1, 0, 1]]    
res = krippendorff.alpha(arr)

On answered 7/6, 2019 at 8:37 Comment(2)

I have a similar situation and want to make sure I understand your data representation right. Here each row represents a raters score, right? For example, the first row [1, 1, 1, 1, 3, 0, 0, 1] represents the first rater. – Likelihood 4/10, 2019 at 6:27

That is my exact situation yes. So the solution should be valid in your case too! – On 4/10, 2019 at 11:9

The basic problem here is that you have not properly applied the data you're given. See here for the proper organization. You have four categories (ratings 0-3) and eight subjects. Thus, your table must have eight rows and four columns, regardless of the quantity of reviewers. For instance, the top row is the tally of ratings given to the first item:

[0, 4, 0, 0]   ... since everyone rated it a `1`.

Your -inf value is from dividing by 0 on the P[j] score for the penultimate column.

My earlier answer, normalizing the scores, was based on my misinterpretation of Fleiss; I had a different reliability in mind. There are many ways to compute such a metric; one is consistency of relative rating points (which you can get with normalization); another is to convert each rater's row into a graph of relative rankings, and compute a similarity among those graphs.

Note that Fleiss is not perfectly applicable to a rating situation with a relative metric: it assumes that this is a classification task, not a ranking. Fleiss is not sensitive to how far apart the ratings are; it knows only that the ratings differed: a (0,1) paring is just as damaging as a (0,3) pairing.

Holloway answered 6/6, 2019 at 16:43 Comment(2)

Running this through statsmodels.stats.inter_rater.fleiss_kappa gives a score of -inf, would you know whats going wrong? – On 6/6, 2019 at 17:16

Yes -- you still have to pre-process your data. I gave you only one trivial scaling; you still have to handle that zero-rating case: you have a 0.00 denominator for P[j] when j=6. – Holloway 6/6, 2019 at 17:30

pay close attention to which axis represents subject, rater or category !

Fleiss' kappa

Now to the solution of the problem:

Finally … this makes more sense, if all three raters agreed perfectly except on subject 5 [3, 3, 2].

Krippendorff's alpha

Recommended topics

Hot tags