predict.lm() with an unknown factor level in test data
Asked Answered
M

7

41

I am fitting a model to factor data and predicting. If the newdata in predict.lm() contains a single factor level that is unknown to the model, all of predict.lm() fails and returns an error.

Is there a good way to have predict.lm() return a prediction for those factor levels the model knows and NA for unknown factor levels, instead of only an error?

Example code:

foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)

I would like the very last command to return three "real" predictions corresponding to factor levels "A", "B" and "C" and an NA corresponding to the unknown level "D".

Mexico answered 26/11, 2010 at 12:15 Comment(0)
R
6

Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.

Additional features

  • drops unused factor levels rather than just setting the missing values to NA.
  • issues a message to the user that factor levels have been dropped
  • checks for existence of factor variables in test_data and returns original data.frame if non are present
  • works not only for lm, glm and but also for glmmPQL

Note: The function shown here may change (improve) over time.

#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#'
#' @param fit fitted model on training data
#'
#' @param test_data data to make predictions for
#'
#' @return data.frame with matching factor levels to fitted model
#'
#' @keywords internal
#'
#' @export
remove_missing_levels <- function(fit, test_data) {

  # https://mcmap.net/q/388735/-predict-lm-with-an-unknown-factor-level-in-test-data

  # drop empty factor levels in test data
  test_data %>%
    droplevels() %>%
    as.data.frame() -> test_data

  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
  # account for it
  if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$contrasts))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
      unlist() -> factor_levels
    factor_levels %>% str_split(":", simplify = TRUE) %>%
      extract(, 1) -> factor_levels

    model_factors <- as.data.frame(cbind(factors, factor_levels))
  } else {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$xlevels))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <- as.data.frame(cbind(factors, factor_levels))
  }

  # Select column names in test data that are factor predictors in
  # trained model

  predictors <- names(test_data[names(test_data) %in% factors])

  # For each factor predictor in your data, if the level is not in the model,
  # set the value to NA

  for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[
      model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
      # track which variable
      var <- predictors[i]
      # set to NA
      test_data[!found, predictors[i]] <- NA
      # drop empty factor levels in test data
      test_data %>%
        droplevels() -> test_data
      # issue warning to console
      message(sprintf(paste0("Setting missing levels in '%s', only present",
                             " in test data but missing in train data,",
                             " to 'NA'."),
                      var))
    }
  }
  return(test_data)
}

We can apply this function to the example in the question as follows:

predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))

While trying to improve this function, I came across the fact that SL learning methods like lm, glm etc. need the same levels in train & test while ML learning methods (svm, randomForest) fail if the levels are removed. These methods need all levels in train & test.

A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels for lm and fit$contrasts for glmmPQL). At least it seems to be consistent across lm related models.

Recor answered 1/6, 2017 at 20:4 Comment(3)
While you've coded up a pretty handy function, I just noticed that this code won't work for a data set with variable names ending in numbers.Chiles
sperrorest has now been subsumed by mlr. Where in mlr is this method?Ligneous
@Ligneous use fix.factor.prediction in makeLearner(), e.g. makeLearner("regr.lm", fix.factors.prediction = TRUE)Recor
S
32

You have to remove the extra levels before any calculation, like:

> id <- which(!(foo.new$predictor %in% levels(foo$predictor)))
> foo.new$predictor[id] <- NA
> predict(model,newdata=foo.new)
         1          2          3          4 
-0.1676941 -0.6454521  0.4524391         NA 

This is a more general way of doing it, it will set all levels that do not occur in the original data to NA. As Hadley mentioned in the comments, they could have chosen to include this in the predict() function, but they didn't

Why you have to do that becomes obvious if you look at the calculation itself. Internally, the predictions are calculated as :

model.matrix(~predictor,data=foo) %*% coef(model)
        [,1]
1 -0.1676941
2 -0.6454521
3  0.4524391

At the bottom you have both model matrices. You see that the one for foo.new has an extra column, so you can't use the matrix calculation any more. If you would use the new dataset to model, you would also get a different model, being one with an extra dummy variable for the extra level.

> model.matrix(~predictor,data=foo)
  (Intercept) predictorB predictorC
1           1          0          0
2           1          1          0
3           1          0          1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"

> model.matrix(~predictor,data=foo.new)
  (Intercept) predictorB predictorC predictorD
1           1          0          0          0
2           1          1          0          0
3           1          0          1          0
4           1          0          0          1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"

You can't just delete the last column from the model matrix either, because even if you do that, both other levels are still influenced. The code for level A would be (0,0). For B this is (1,0), for C this (0,1) ... and for D it is again (0,0)! So your model would assume that A and D are the same level if it would naively drop the last dummy variable.

On a more theoretical part: It is possible to build a model without having all the levels. Now, as I tried to explain before, that model is only valid for the levels you used when building the model. If you come across new levels, you have to build a new model to include the extra information. If you don't do that, the only thing you can do is delete the extra levels from the dataset. But then you basically lose all information that was contained in it, so it's generally not considered good practice.

Swanskin answered 26/11, 2010 at 12:38 Comment(10)
I'm not entirely sure why this would be impossible in theory... if (if! I should have specified this in advance) I use a contr.treatment model matrix, the other factor levels should not be influenced, should they?Mexico
I very much appreciate your explanations, but I still don't get it... Yes, of course a 3-level factor and a 4-level factor don't carry the same information. But why shouldn't one make predictions for the factor levels one has already seen? Yes, the model matrix for the 4-level factor won't fit the coefficients for the 3-level factor, but one could simply remove the column that corresponds to the unknown level. My application is forecasting sales depending on day of week - and shouldn't one be able to forecast sales for a Monday (which we have seen), even if the store was never open on Sunday?Mexico
@Stephan : Off course. But not if you have sale data on sunday that you didn't bring into the original model. Because a shop that sells on sunday will not sell the same amount on monday as a shop that didn't open on sunday. Hence, model and new data are not compatible because they don't talk about exactly the same thing. That's the thing with statistics: it's math, it's not some general theory.Swanskin
@Stephan: added another angle to look at it, maybe that clears things up.Swanskin
I think you're off base here - there are many situations in which you might not know all of the possible values in advance, and when encountering a new value returning a missing value is a sensible choice. The fact that the model matrix would have a different representation is a red herring.Zacharia
@hadley: off course there are many situations where you don't know all of the possible values. But you can't possibly expect your model to use them in any sense. Whether your function cuts them out for you before calculating the prediction and returns NA, or you do that yourself before calling the function, is merely a decision of function design. And for predict(), they chose to give you an error. But bottom line is that you can't use those levels in the calculation, they have to be removed before you calculate anything.Swanskin
@hadley: I hope it's a bit more clear now what I mean. I'm not a native english speaker, so I might have missed some of the subtleties of the english language. Sorry if that's the case.Swanskin
One of the assumptions of Linear/Logistic Regressions is to little or no multi-collinearity; so if the predictor variables are ideally independent of each other, then the model does not need to see all the possible variety of factor levels. A new factor level (D) is a new predictor, and can be set to NA without affecting the predicting ability of the remaining factors A,B,C. This is why the model should still be able to make predictions. But addition of the new level D throws off the expected schema. That's the whole issue. Setting NA fixes that.Sayette
I'd like to know why this suddenly received two downvotes. Not minding the downvote, but I'd like to know where I'm supposed to be wrong.Swanskin
I'll un-accept because @pat-s gave an answer that actually does what I asked for. To be honest, I don't see how the discussion here answered my key argument: suppose I have historical sales data for (many) Mondays through Saturdays, but none for a Sunday, and I fit a model containing a day of week factor. Then I try to predict sales for a full week. I'm good with getting an NA prediction for Sunday, but I still don't see why predict should fail on Monday-Saturday. I suspect that some downvotes followed a similar logic. (I didn't downvote, BTW)Mexico
I
7

If you want to deal with the missing levels in your data after creating your lm model but before calling predict (given we don't know exactly what levels might be missing beforehand) here is function I've built to set all levels not in the model to NA - the prediction will also then give NA and you can then use an alternative method to predict these values.

object will be your lm output from lm(...,data=trainData)

data will be the data frame you want to create predictions for

missingLevelsToNA<-function(object,data){

  #Obtain factor predictors in the model and their levels ------------------

  factors<-(gsub("[-^0-9]|as.factor|\\(|\\)", "",names(unlist(object$xlevels))))
  factorLevels<-unname(unlist(object$xlevels))
  modelFactors<-as.data.frame(cbind(factors,factorLevels))


  #Select column names in your data that are factor predictors in your model -----

  predictors<-names(data[names(data) %in% factors])


  #For each factor predictor in your data if the level is not in the model set the value to NA --------------

  for (i in 1:length(predictors)){
    found<-data[,predictors[i]] %in% modelFactors[modelFactors$factors==predictors[i],]$factorLevels
    if (any(!found)) data[!found,predictors[i]]<-NA
  }

  data

}
Irrational answered 14/9, 2016 at 16:29 Comment(2)
Thank you for this function. I think predict() should do this internally, and send a warning, instead of failing completely.Breakaway
This is a very nice solution!Kohl
R
6

Tidied and extended the function by MorgenBall. It is also implemented in sperrorest now.

Additional features

  • drops unused factor levels rather than just setting the missing values to NA.
  • issues a message to the user that factor levels have been dropped
  • checks for existence of factor variables in test_data and returns original data.frame if non are present
  • works not only for lm, glm and but also for glmmPQL

Note: The function shown here may change (improve) over time.

#' @title remove_missing_levels
#' @description Accounts for missing factor levels present only in test data
#' but not in train data by setting values to NA
#'
#' @import magrittr
#' @importFrom gdata unmatrix
#' @importFrom stringr str_split
#'
#' @param fit fitted model on training data
#'
#' @param test_data data to make predictions for
#'
#' @return data.frame with matching factor levels to fitted model
#'
#' @keywords internal
#'
#' @export
remove_missing_levels <- function(fit, test_data) {

  # https://mcmap.net/q/388735/-predict-lm-with-an-unknown-factor-level-in-test-data

  # drop empty factor levels in test data
  test_data %>%
    droplevels() %>%
    as.data.frame() -> test_data

  # 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
  # account for it
  if (any(class(fit) == "glmmPQL")) {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$contrasts))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    map(fit$contrasts, function(x) names(unmatrix(x))) %>%
      unlist() -> factor_levels
    factor_levels %>% str_split(":", simplify = TRUE) %>%
      extract(, 1) -> factor_levels

    model_factors <- as.data.frame(cbind(factors, factor_levels))
  } else {
    # Obtain factor predictors in the model and their levels
    factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
                     names(unlist(fit$xlevels))))
    # do nothing if no factors are present
    if (length(factors) == 0) {
      return(test_data)
    }

    factor_levels <- unname(unlist(fit$xlevels))
    model_factors <- as.data.frame(cbind(factors, factor_levels))
  }

  # Select column names in test data that are factor predictors in
  # trained model

  predictors <- names(test_data[names(test_data) %in% factors])

  # For each factor predictor in your data, if the level is not in the model,
  # set the value to NA

  for (i in 1:length(predictors)) {
    found <- test_data[, predictors[i]] %in% model_factors[
      model_factors$factors == predictors[i], ]$factor_levels
    if (any(!found)) {
      # track which variable
      var <- predictors[i]
      # set to NA
      test_data[!found, predictors[i]] <- NA
      # drop empty factor levels in test data
      test_data %>%
        droplevels() -> test_data
      # issue warning to console
      message(sprintf(paste0("Setting missing levels in '%s', only present",
                             " in test data but missing in train data,",
                             " to 'NA'."),
                      var))
    }
  }
  return(test_data)
}

We can apply this function to the example in the question as follows:

predict(model,newdata=remove_missing_levels (fit=model, test_data=foo.new))

While trying to improve this function, I came across the fact that SL learning methods like lm, glm etc. need the same levels in train & test while ML learning methods (svm, randomForest) fail if the levels are removed. These methods need all levels in train & test.

A general solution is quite hard to achieve since every fitted model has a different way of storing their factor level component (fit$xlevels for lm and fit$contrasts for glmmPQL). At least it seems to be consistent across lm related models.

Recor answered 1/6, 2017 at 20:4 Comment(3)
While you've coded up a pretty handy function, I just noticed that this code won't work for a data set with variable names ending in numbers.Chiles
sperrorest has now been subsumed by mlr. Where in mlr is this method?Ligneous
@Ligneous use fix.factor.prediction in makeLearner(), e.g. makeLearner("regr.lm", fix.factors.prediction = TRUE)Recor
L
2

Sounds like you might like random effects. Look into something like glmer (lme4 package). With a Bayesian model, you'll get effects that approach 0 when there's little information to use when estimating them. Warning, though, that you'll have to do prediction yourself, rather than using predict().

Alternatively, you can simply make dummy variables for the levels you want to include in the model, e.g. a variable 0/1 for Monday, one for Tuesday, one for Wednesday, etc. Sunday will be automatically removed from the model if it contains all 0's. But having a 1 in the Sunday column in the other data won't fail the prediction step. It will just assume that Sunday has an effect that's average the other days (which may or may not be true).

Lewendal answered 11/11, 2013 at 18:59 Comment(1)
thanks, I found this answer helpful for my question on CV: stats.stackexchange.com/questions/172696/…Charentemaritime
S
2

One of the assumptions of Linear/Logistic Regressions is to little or no multi-collinearity; so if the predictor variables are ideally independent of each other, then the model does not need to see all the possible variety of factor levels. A new factor level (D) is a new predictor, and can be set to NA without affecting the predicting ability of the remaining factors A,B,C. This is why the model should still be able to make predictions. But addition of the new level D throws off the expected schema. That's the whole issue. Setting NA fixes that.

Sayette answered 8/5, 2015 at 22:32 Comment(0)
W
1

The lme4 package will handle new levels if you set the flag allow.new.levels=TRUE when calling predict.

Example: if your day of week factor is in a variable dow and a categorical outcome b_fail, you could run

M0 <- lmer(b_fail ~ x + (1 | dow), data=df.your.data, family=binomial(link='logit')) M0.preds <- predict(M0, df.new.data, allow.new.levels=TRUE)

This is an example with a random effects logistic regression. Of course, you can perform regular regression ... or most GLM models. If you want to head further down the Bayesian path, look at Gelman & Hill's excellent book and the Stan infrastructure.

Wilcox answered 30/6, 2017 at 0:26 Comment(1)
That sounds helpful. Could you perhaps edit your answer to include runnable code? If I simply change lm to lmer, R complains that I didn't specify any random effects.Mexico
F
0

A quick-and-dirty solution for split testing, is to recode rare values as "other". Here is an implementation:

rare_to_other <- function(x, fault_factor = 1e6) {
  # dirty dealing with rare levels:
  # recode small cells as "other" before splitting to train/test,
  # assuring that lopsided split occurs with prob < 1/fault_factor
  # (N.b. not fully kosher, but useful for quick and dirty exploratory).

  if (is.factor(x) | is.character(x)) {
    min.cell.size = log(fault_factor, 2) + 1
    xfreq <- sort(table(x), dec = T)
    rare_levels <- names(which(xfreq < min.cell.size))
    if (length(rare_levels) == length(unique(x))) {
      warning("all levels are rare and recorded as other. make sure this is desirable")
    }
    if (length(rare_levels) > 0) {
      message("recoding rare levels")
      if (is.factor(x)) {
        altx <- as.character(x)
        altx[altx %in% rare_levels] <- "other"
        x <- as.factor(altx)
        return(x)
      } else {
        # is.character(x)
        x[x %in% rare_levels] <- "other"
        return(x)
      }
    } else {
      message("no rare levels encountered")
      return(x)
    }
  } else {
    message("x is neither a factor nor a character, doing nothing")
    return(x)
  }
}

For example, with data.table, the call would be something like:

dt[, (xcols) := mclapply(.SD, rare_to_other), .SDcol = xcols] # recode rare levels as other

where xcols is a any subset of colnames(dt).

Felicity answered 5/8, 2018 at 7:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.