Error in Confusion Matrix : the data and reference factors must have the same number of levels
Asked Answered
B

8

44

I've trained a Linear Regression model with R caret. I'm now trying to generate a confusion matrix and keep getting the following error:

Error in confusionMatrix.default(pred, testing$Final) : the data and reference factors must have the same number of levels

EnglishMarks <- read.csv("E:/Subject Wise Data/EnglishMarks.csv", 
header=TRUE)
inTrain<-createDataPartition(y=EnglishMarks$Final,p=0.7,list=FALSE)
training<-EnglishMarks[inTrain,]
testing<-EnglishMarks[-inTrain,]
predictionsTree <- predict(treeFit, testdata)
confusionMatrix(predictionsTree, testdata$catgeory)
modFit<-train(Final~UT1+UT2+HalfYearly+UT3+UT4,method="lm",data=training)
pred<-format(round(predict(modFit,testing)))              
confusionMatrix(pred,testing$Final)

The error occurs when generating the confusion matrix. The levels are the same on both objects. I cant figure out what the problem is. Their structure and levels are given below. They should be the same. Any help would be greatly appreciated as its making me cracked!!

> str(pred)
chr [1:148] "85" "84" "87" "65" "88" "84" "82" "84" "65" "78" "78" "88" "85"  
"86" "77" ...
> str(testing$Final)
int [1:148] 88 85 86 70 85 85 79 85 62 77 ...

> levels(pred)
NULL
> levels(testing$Final)
NULL
Babel answered 2/5, 2015 at 11:57 Comment(2)
The clue is right in your output of str. See how they are different? pred is of the class character and testing$Final is of class integer. when you call format here pred<-format(round(predict(modFit,testing))), it's converting it to character format, as it does that when supplied a list. Why are you doing format? and you should probably be calculating RMSE or MAE of your model, have a look at this heuristically.wordpress.com/2013/07/12/…Medievalist
@Medievalist Now I have coverted the char result to int by using pred<-as.integer(format(round(predict(modFit,testing)))) command but still the same error persists as before.I don't know where I am going wrong.Babel
P
24

I had the same issue. I guess it happened because data argument was not casted as factor as I expected. Try:

confusionMatrix(pred,as.factor(testing$Final))

hope it helps

Psychiatry answered 8/4, 2019 at 1:8 Comment(1)
It did the trick for me. Thanks for sharing :))Lisk
T
15
confusionMatrix(pred,testing$Final)

Whenever you try to build a confusion matrix, make sure that both the true values and prediction values are of factor datatype.

Here both pred and testing$Final must be of type factor. Instead of check for levels, check the type of both the variables and convert them to factor if they are not.

Here testing$final is of type int. conver it to factor and then build the confusion matrix.

Tallboy answered 31/7, 2018 at 9:36 Comment(0)
G
13

Do table(pred) and table(testing$Final). You will see that there is at least one number in the testing set that is never predicted (i.e. never present in pred). This is what is meant why "different number of levels". There is an example of a custom made function to get around this problem here.

However, I found that this trick works fine:

table(factor(pred, levels=min(test):max(test)), 
      factor(test, levels=min(test):max(test)))

It should give you exactly the same confusion matrix as with the function.

Gilbert answered 10/5, 2015 at 4:25 Comment(0)
F
6

Something like the follows seem to work for me. The idea is similar to that of @nayriz:

confusionMatrix(
  factor(pred, levels = 1:148),
  factor(testing$Final, levels = 1:148)
)

The key is to make sure the factor levels match.

Fungi answered 30/4, 2018 at 20:57 Comment(0)
A
5

On a similar error, I forced the GLM predictions to have the same class as the dependent variable.

For example, a GLM will predict a "numeric" class. But with the target variable being a "factor" class, I ran into an error.

erroneous code:

#Predicting using logistic model
glm.probs = predict(model_glm, newdata = test, type = "response")
test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")


#Checking the accuracy of the logistic model
    confusionMatrix(test$default,test$pred_glm)

Result:

Error: `data` and `reference` should be factors with the same levels.

corrected code:

#Predicting using logistic model
    glm.probs = predict(model_glm, newdata = test, type = "response")
    test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")
    test$pred_glm = as.factor(test$pred_glm)
    
#Checking the accuracy of the logistic model
confusionMatrix(test$default,test$pred_glm)

Result:

confusion Matrix and Statistics

          Reference
Prediction     0     1
         0   182  1317
         1   122 22335
                                          
               Accuracy : 0.9399          
                 95% CI : (0.9368, 0.9429)
    No Information Rate : 0.9873          
    P-Value [Acc > NIR] : 1          
Aerie answered 28/1, 2021 at 17:26 Comment(0)
F
0

I had this problem due to NAs for the target variable in the dataset. If you're using the tidyverse, you can use the drop_na function to remove rows that contain NAs. Like this:

iris %>% drop_na(Species) # Removes rows where Species column has NA
iris %>% drop_na() # Removes rows where any column has NA

For base R, it might look something like:

iris[! is.na(iris$Species), ] # Removes rows where Species column has NA
na.omit(iris) # Removes rows where any column has NA
Figurine answered 11/12, 2020 at 17:0 Comment(0)
D
0

We get this error when creating the confusion matrix. When creating a confusion matrix, we need to make sure that the predicted value and the actual value of the data type are "factors". If there are other data types, we must convert them to "factor" data factors before generating a confusion matrix. After this conversion, start compiling the confusion matrix.

pridicted <- factor(predict(treeFit, testdata))
real <- factor(testdata$catgeory)
my_data1 <- data.frame(data = pridicted, type = "prediction")
my_data2 <- data.frame(data = real, type = "real"
my_data3 <- rbind(my_data1,my_data2)
# Check if the levels are identical
identical(levels(my_data3[my_data3$type == "prediction",1]) , 
levels(my_data3[my_data3$type == "real",1]))
confusionMatrix(my_data3[my_data3$type == "prediction",1], 
my_data3[my_data3$type == "real",1],  dnn = c("Prediction", "Reference"))
Decant answered 8/12, 2021 at 14:25 Comment(1)
When linking to your own site or content (or content that you are affiliated with), you must disclose your affiliation in the answer in order for it not to be considered spam. Having the same text in your username as the URL or mentioning it in your profile is not considered sufficient disclosure under Stack Exchange policy.Sulamith
C
-4

Your are using regression and trying to generate a confusion matrix. I believe confusion matrix is used for classification task. Generally people use R^2 and RMSE metrics.

Columbite answered 1/1, 2019 at 2:32 Comment(2)
Regression can be used for classification tasks as well.Poundal
as long as it has 2 classes.Lanti

© 2022 - 2024 — McMap. All rights reserved.