Some modeling functions, e.g. glmnet()
, require (or just allow for) the data to be passed in as a predictor matrix and a response matrix (or vector) as apposed to using a formula. In these cases, it's typically the case that the predict()
method, e.g. predict.glmnet()
, requires that the newdata
argument provides a predictor matrix with the same features as was used to train the model.
A convenient way to create a predictor matrix when your dataframe has factors (R's categorical data type) is to use the model.matrix()
function, which automatically creates dummy features for your categorical variables:
# this is the dataframe and matrix I want to use to train the model
set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
x2 = rnorm(20, 100, 5),
x3 = factor(sample(c("U","L"), replace = T, 20)),
y = rnorm(20, 10, 2))
mm <- model.matrix(y~., data = df)
But when I introduce a dataframe with new observations that contain only a subset of the levels of the factors from the original dataframe, model.matrix()
(predictably) returns a matrix with different dummy features. This new matrix cannot be used in predict.glm()
because it doesn't have the same features that the model is expecting:
# this is the dataframe and matrix I want to predict on
set.seed(1)
df_new <- data.frame(x1 = c("B", "C"),
x2 = rnorm(2, 100, 5),
x3 = c("L","U"))
mm_new <- model.matrix(~., data = df_new)
Is there a way to save the transformation (creating all necessary dummy features) from a dataframe to a model matrix so that I can re-apply this transformation to future observations? In my above example, this would ideally result in mm_new
having identical feature names as mm
so that predict()
can accept mm_new
.
I want to add that I'm aware of this approach, which essentially suggests to include the observations from df_new
in df
before calling model.matrix()
. This work fine if I have all the observations to begin with, and I'm just training and testing models. However, the new observations will only be accessible in the future (in a production prediction pipeline), and I want to avoid the overhead of re-loading the entire training dataframe for new predictions.