TPR & FPR Curve for different classifiers - kNN, NaiveBayes, Decision Trees in R
Asked Answered
O

2

20

I'm trying to understand and plot TPR/FPR for different types of classifiers. I'm using kNN, NaiveBayes and Decision Trees in R. With kNN I'm doing the following:

clnum <- as.vector(diabetes.trainingLabels[,1], mode = "numeric")
dpknn <- knn(train = diabetes.training, test = diabetes.testing, cl = clnum, k=11, prob = TRUE)
prob <- attr(dpknn, "prob")
tstnum <- as.vector(diabetes.testingLabels[,1], mode = "numeric")
pred_knn <- prediction(prob, tstnum)
pred_knn <- performance(pred_knn, "tpr", "fpr")
plot(pred_knn, avg= "threshold", colorize=TRUE, lwd=3, main="ROC curve for Knn=11")

where diabetes.trainingLabels[,1] is a vector of labels (class) I want to predict, diabetes.training is the training data and diabetes.testing is the testing data.

Plot looks like the following: enter image description here

The values stored in prob attribute is a numeric vector (decimal between 0 and 1). I convert the class labels factor into numbers and then I can use it with prediction/performance function from ROCR library. Not 100% sure I'm doing it correct but at least it works.

For the NaiveBayes and Decision Trees tho, with prob/raw parameter specified in predict function I don't get a single numeric vector but a vector of lists or matrix where probability for each class is specified (I guess), eg:

diabetes.model <- naiveBayes(class ~ ., data = diabetesTrainset)
diabetes.predicted <- predict(diabetes.model, diabetesTestset, type="raw")

and diabetes.predicted is:

tested_negative tested_positive
[1,]    5.787252e-03       0.9942127
[2,]    8.433584e-01       0.1566416
[3,]    7.880800e-09       1.0000000
[4,]    7.568920e-01       0.2431080
[5,]    4.663958e-01       0.5336042

The question is how to use it to plot ROC curve and why in kNN I get one vector and for other classifiers I get them separate for both classes?

Ofori answered 17/12, 2015 at 12:51 Comment(0)
C
0

ROC curve

The ROC curve you provided for knn11 classifier looks off - it is below the diagonal indicating that your classifier assigns class labels correctly less than 50% of the time. Most likely what happened there is that you provided wrong class labels or wrong probabilities. If in training you used class labels of 0 and 1 - those same class labels should be passed to ROC curve in the same order (without 0 and one flipping).

Another less likely possibility is that you have a very weird dataset.

Probabilities for other classifiers

ROC curve was developed to call events from the radar. Technically it is closely related to predicting an event - probability that you correctly guess the even of a plane approaching from the radar. So it uses one probability. This can be confusing when someone does classification on two classes where "hit" probabilities are not evident, like in your case where you have cases and controls.

However any two class classification can be termed in terms of "hits" and "misses" - you just have to select a class which you will call an "event". In your case having diabetes might be called an event.

So from this table:

 tested_negative tested_positive
 [1,]    5.787252e-03       0.9942127
 [2,]    8.433584e-01       0.1566416
 [3,]    7.880800e-09       1.0000000
 [4,]    7.568920e-01       0.2431080
 [5,]    4.663958e-01       0.5336042

You would only have to select one probability - that of an event - probably "tested_positive". Another one "tested_negative" is just 1-tested_positive because when classifier things that a particular person has diabetes with 79% chance - he at the same time "thinks" that there is a 21% chance of that person not having diabetes. But you only need one number to express this idea, so knn only returns one, while other classifier can return two.

I don't know which library you used for decision trees so cannot help with the output of that classifier.

Cinnamon answered 20/3, 2016 at 12:2 Comment(0)
F
0

Looks like you are doing something fundamentally wrong. enter image description here

Ideally KNN graph looks like the one above. Here are a few points you can use.

  1. Calculate distance in you code.
  2. Use below code for prediction in python

Predicted class

print(model_name.predict(test))

3 nearest neighbors

print(model_name.kneighbors(test)[1])
Faradmeter answered 10/2, 2021 at 6:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.