predict.glmnet: Some Factors Have Only One Level in New Data
Asked Answered
G

1

2

I've trained an elastic net model in R using glmnet and would like to use it to make predictions off of a new data set.

But I'm having trouble producing the matrix to use as an argument in the predict() method because some of my factor variables (dummy variables indicating the presence of comorbidities) in the new data set only have one level (the comorbidities were never observed), which means I can't use

model.matrix(RESPONSE ~ ., new_data)

because it gives me the (expected)

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

I'm at a loss for how to get around this issue. Is there a way in R that I can construct an appropriate matrix for use in predict() in this situation, or do I need to prepare the matrix outside of R? In either case, how might I go about doing it?

Here is a toy example that reproduces the issue I'm having:

x1 <- rnorm(100)
x2 <- as.factor(rbinom(100, 1, 0.6))
x3 <- as.factor(rbinom(100, 1, 0.4))
y <- rbinom(100, 1, 0.2)

toy_data <- data.frame(x1, x2, x3, y)
colnames(toy_data) = c("Continuous", "FactorA", "FactorB", "Outcome")

mat1 <- model.matrix(Outcome ~ ., toy_data)[,-1]
y1 <- toy_data$Outcome

new_data <- toy_data
new_data$FactorB <- as.factor(0)

#summary(new_data) # Just to verify that FactorB now only contains one level

mat2 <- model.matrix(Outcome ~ ., new_data)[,-1]
Goddord answered 21/8, 2018 at 15:58 Comment(3)
It is better practice to keep track of all categorical variables in train dataset. In new data change those categoricals in way model.matrix does it (dummy variable creation). Then use that for prediction. Doing so enable you to predict even single record.Tarbes
You can set the levels of your factor to match the levels in the training data prior to doing the model matrix. Like levels(new_data$FactorB) <- levels(toy_data$FactorB)Outfoot
@Outfoot That seems to have fixed the issue! If you have time, post as an answer and I'll upvote/accept.Goddord
O
3

You can set the levels of your dataset to match the levels of the complete dataset in your example. A factor can have values present in the levels even when that value isn't present in the variable.

You can do this with the levels argument in factor():

new_data$FactorB <- factor(0, levels = levels(toy_data$FactorB))

Or by using the levels() function with assignment:

levels(new_data$FactorB) <- levels(toy_data$FactorB)

Using either approach, model.matrix() works properly once you have more than one level:

head( model.matrix(Outcome ~ ., new_data)[,-1] )
   Continuous FactorA1 FactorB1
1 -1.91632972        0        0
2  1.11411267        0        0
3 -1.21333837        1        0
4 -0.06311276        0        0
5  1.31599915        0        0
6  0.36374591        1        0
Outfoot answered 21/8, 2018 at 16:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.