How to deal with multiple class ROC analysis in R (pROC package)?
Asked Answered
S

3

6

When I use multiclass.roc function in R (pROC package), for instance, I trained a data set by random forest, here is my code:

# randomForest & pROC packages should be installed:
# install.packages(c('randomForest', 'pROC'))
data(iris)
library(randomForest)
library(pROC)
set.seed(1000)
# 3-class in response variable
rf = randomForest(Species~., data = iris, ntree = 100)
# predict(.., type = 'prob') returns a probability matrix
multiclass.roc(iris$Species, predict(rf, iris, type = 'prob'))

And the result is:

Call:
multiclass.roc.default(response = iris$Species, predictor = predict(rf,     
iris, type = "prob"))
Data: predict(rf, iris, type = "prob") with 3 levels of iris$Species: setosa,   
versicolor, virginica.
Multi-class area under the curve: 0.5142

Is this right? Thanks!!!

"pROC" reference: http://www.inside-r.org/packages/cran/pROC/docs/multiclass.roc

Sweetheart answered 11/12, 2013 at 10:29 Comment(0)
A
10

As you saw in the reference, multiclass.roc expects a "numeric vector (...)", and the documentation of roc that is linked from there (for some reason not in the link you provided) further says "of the same length than response". You are passing a numeric matrix with 3 columns, which is clearly wrong, and isn't supported any more since pROC 1.6. I have no idea what it was doing before, probably not what you were expecting.

This means you must summarize your predictions in one single atomic vector of numeric mode. In the case of your model, you could use the following, although it generally doesn't really make sense to convert a factor into a numeric:

predictions <- as.numeric(predict(rf, iris, type = 'response'))
multiclass.roc(iris$Species, predictions)

What this code really does is to compute 3 ROC curves on your predictions (one with setosa vs. versicolor, one with versicolor vs. virginica, and one with setosa vs. virginica) and average their AUC.

Three more comments:

  • I say converting a factor to numeric doesn't make sense because you'll get different results if you don't have a perfect classification and you reorder the levels. This is why it isn't done automatically in pROC: you must think about it in your setup.
  • In general, this multiclass averaging doesn't really make sense and you're better off re-thinking your question in terms of binary classification. There are more advanced multiclass methods (with a ROC surface etc.) that aren't implemented yet in pROC
  • As was stated by @cbeleites, it is not correct to evaluate a model with its training data (resubstitution) so in a real example you must keep a test set aside or use cross-validation.
Amphioxus answered 29/12, 2013 at 9:57 Comment(0)
L
1

Assuming that you did the resubstitution estimate only for sake of the minimal working example your code looks good to me.

I quickly tried to get an oob prediction with type "prob" but didn't succeed. Thus, you'll need to do a validation external to the randomForest function.

Personally, I'd not try to summarize a whole multiclass model into one unconditional number. But that's an entirely different question.

Lorusso answered 11/12, 2013 at 11:40 Comment(2)
I am so sorry, I did not quite understand your meaning.Sweetheart
There is an example showing the method of multiclass.auclink, but I don't know what does s100b mean. It must not the probability but what it is? Thanks!Sweetheart
Y
0

I copied your code and got an AUC of .83. Not sure what is different.

You are right, the s100b column is not a probability. The aSAH (Aneurysmal subarachnoid hemorrhage) data set is a clinical data set. s100b is a protein found in glial cells in the brain. From the research paper on the dataset, s100b column seems to represent the concentration of the s100b protein (ug/l) likely in a blood sample.

Yachting answered 11/12, 2013 at 15:36 Comment(4)
You mean the result of your experiment is 0.83?? Additionally, you mean s100b is a real variable but not a 'score'? But when I used multiclass.roc function the predictor is a probability matrix, thus it should be wrong. What is the exact answer of this question?Sweetheart
Sorry if I made matters more confusing. The pROC help documentation just states that the predictor argument is a numeric vector. So I believe your initial example on the iris data is correct.So I was commenting on the example that you linked to on the pROC documentation (used the aSAH data set). After data(aSAH), I looked at the research paper linked in the help (?aSAH). If you search for s100b you will see that it looks like a concentration of a protein.Yachting
Actually my initial example iris data used a probability matrix as the predictor in multiclass.roc function. But the problem is the result of AUC is not significant (0.5142)! Thus, I think my code was wrong though I don't know how to correct it. Thanks!Sweetheart
By the way, I modified this code and please have a look at it:hereThanks!Sweetheart

© 2022 - 2024 — McMap. All rights reserved.