Warning message: "missing values in resampled performance measures" in caret train() using rpart
Asked Answered
G

6

34

I am using the caret package to train a model with "rpart" package;

tr = train(y ~ ., data = trainingDATA, method = "rpart")

Data has no missing values or NA's, but when running the command a warning message comes up;

    Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

Does anyone know (or could point me to where to find an answer) what does this warning mean? I know it is telling me that there were missing values in resampled performance measures - but what does that exactly mean and how can a situation like that arise? BTW, the predict() function works fine with the fitted model, so it is just my curiosity.

Gladiolus answered 9/11, 2014 at 13:51 Comment(0)
C
29

Not definitively sure without more data.

If this is regression, the most likely case is that the tree did not find a good split and used the average of the outcome as the predictor. That's fine but you cannot calculate R^2 since the variance of the predictions is zero.

If classification, it's hard to say. You could have a resample where one of the outcome classes has zero samples so sensitivity or specificity is undefined and thus NA.

Cetology answered 14/11, 2014 at 20:36 Comment(3)
Thanks @Cetology . It is regression so no good split is a plausibe reason. BTW, do you know of any good book to explaining linear regression with random forest?Gladiolus
@topepo, I have been experiencing the same problem with rpart and nnet. For the latter I simply had to set linout = TRUE to get rid of the warning message and obtain proper cross-validation predictions. However, I could not find a solution for rpart yet: cross-validation predictions were perfectly fine. I have the feeling that rpart is expecting some argument which we cannot pass using train such as method = "anova". The help page of rpart says that "it is wisest to specify the method directly".Riha
I have the same problem with random forest (method = rf), but only if the number of rows in the data set is too small.With a bigger data set (same structure as smaller one) the warning doesn't occur.Enrika
C
6

The Problem

The problem is that the rpart is using a tree based algorithm, which can only handle a limited number of factors in a given feature. So you may have a variable that has been set to a factor with more than 53 categories:

> rf.1 <- randomForest(x = rf.train.2, 
+                      y = rf.label, 
+                      ntree = 1000)
Error in randomForest.default(x = rf.train.2, y = rf.label, ntree = 1000) : 
Can not handle categorical predictors with more than 53 categories.

At the base of your problem, caret is running that function, so make sure you fix up your categorical variables with more than 53 levels.

Here is where my problem lied before (notice zipcode coming in as a factor):

# ------------------------------- #
# RANDOM FOREST WITH CV 10 FOLDS  #
# ------------------------------- #
rf.train.2 <- df_train[, c("v1",
                      "v2",
                      "v3",
                      "v4",
                      "v5",
                      "v6",
                      "v7",
                      "v8",
                      "zipcode",
                      "price",
                      "made_purchase")]
rf.train.2 <- data.frame(v1=as.factor(rf.train.2$v1),
                     v2=as.factor(rf.train.2$v2),
                     v3=as.factor(rf.train.2$v3),
                     v4=as.factor(rf.train.2$v4),
                     v5=as.factor(rf.train.2$v5),
                     v6=as.factor(rf.train.2$v6),
                     v7=as.factor(rf.train.2$v7),
                     v8=as.factor(rf.train.2$v8),
                     zipcode=as.factor(rf.train.2$zipcode),
                     price=rf.train.2$price,
                     made_purchase=as.factor(rf.train.2$made_purchase))
rf.label <- rf.train.2[,"made_purchase"]

The Solution

Remove all categorical variables that have more than 53 levels.

Here is my fixed up code, adjusting the categorical variable zipcode, you could even have wrapped it in a numeric wrapper like this: as.numeric(rf.train.2$zipcode).

# ------------------------------- #
# RANDOM FOREST WITH CV 10 FOLDS  #
# ------------------------------- #
rf.train.2 <- df_train[, c("v1",
                      "v2",
                      "v3",
                      "v4",
                      "v5",
                      "v6",
                      "v7",
                      "v8",
                      "zipcode",
                      "price",
                      "made_purchase")]
rf.train.2 <- data.frame(v1=as.factor(rf.train.2$v1),
                     v2=as.factor(rf.train.2$v2),
                     v3=as.factor(rf.train.2$v3),
                     v4=as.factor(rf.train.2$v4),
                     v5=as.factor(rf.train.2$v5),
                     v6=as.factor(rf.train.2$v6),
                     v7=as.factor(rf.train.2$v7),
                     v8=as.factor(rf.train.2$v8),
                     zipcode=rf.train.2$zipcode,
                     price=rf.train.2$price,
                     made_purchase=as.factor(rf.train.2$made_purchase))
rf.label <- rf.train.2[,"made_purchase"]
Cavendish answered 21/1, 2017 at 4:5 Comment(2)
I have only male female and got the same error messageHedgepeth
Perhaps I'm wrong, but it's unclear to me that it would be wise to convert something like zip code into an integer. If it's an integer, then the algorithm is going to treat it like a covariate instead of a factor, so the zipcode 55105 is one unit greater than 55104, when really they don't have that kind of relationship. I think you'd be better to reduce the precision of the zipcode down perhaps just to the first two digits. I realize this discussion is kind of stale, but I thought it was worth discussing anyway.Wellgrounded
C
5

This error happens when the model didn't converge in some cross-validation folds the predictions get zero variance. As a result, the metrics like RMSE or Rsquared can't be calculated so they become NAs. Sometimes there are parameters you can tune for better convergence, e.g. the neuralnet library offers to increase threshold which almost always leads to convergence. Yet, I'm not sure about the rpart library.

Another reason for this to happen is that you have already NAs in your training data. Then the obvious cure is to remove them before passing them by train(data = na.omit(training.data)).

Hope that enlightens a bit.

Calderon answered 27/1, 2019 at 10:8 Comment(0)
R
0

I was hitting the same error when fitting training data to a single decision tree. But it got resolved once I remove the NA values from the raw data before splitting in training and test set. I guess it was a mismatch of data when we split and fitting in model. Steps: 1: remove NA from raw data. 2: Now split in training and test set. 3: Train model now and hope it fixes error now.

Reece answered 7/11, 2019 at 11:19 Comment(1)
this could be the problem, but the OP said the data has no missing values or NAs..Mcleod
P
0

In my case, aided by bmc's answer, I discovered it was because the outcome column was numeric (as provided by the dataset). Converting it to factor and then running train succeeded with no error.

Pigsty answered 31/8, 2020 at 22:15 Comment(0)
D
0

My problem was that I accidentially used createDataPartition() (or its friends: createFolds(), createMultiFolds() etc.) with the meta data that was not divided into train and validation.

The result is that some of the the indexes in the corss-validation lists exceeded the training data.

Damico answered 17/1, 2022 at 13:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.