How to save mapping of data.frame-to-model.matrix and apply to new observations?

Asked 24/4, 2017 at 2:18 Answered 10/8, 2021 at 20:44

r matrix dataframe prediction dummy-variable

Some modeling functions, e.g. glmnet(), require (or just allow for) the data to be passed in as a predictor matrix and a response matrix (or vector) as apposed to using a formula. In these cases, it's typically the case that the predict() method, e.g. predict.glmnet(), requires that the newdata argument provides a predictor matrix with the same features as was used to train the model.

A convenient way to create a predictor matrix when your dataframe has factors (R's categorical data type) is to use the model.matrix() function, which automatically creates dummy features for your categorical variables:

# this is the dataframe and matrix I want to use to train the model
set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
                 x2 = rnorm(20, 100, 5),
                 x3 = factor(sample(c("U","L"), replace = T, 20)),
                 y = rnorm(20, 10, 2))

mm <- model.matrix(y~., data = df)

But when I introduce a dataframe with new observations that contain only a subset of the levels of the factors from the original dataframe, model.matrix() (predictably) returns a matrix with different dummy features. This new matrix cannot be used in predict.glm() because it doesn't have the same features that the model is expecting:

# this is the dataframe and matrix I want to predict on
set.seed(1)
df_new <- data.frame(x1 = c("B", "C"),
                     x2 = rnorm(2, 100, 5),
                     x3 = c("L","U"))

mm_new <- model.matrix(~., data = df_new)

Is there a way to save the transformation (creating all necessary dummy features) from a dataframe to a model matrix so that I can re-apply this transformation to future observations? In my above example, this would ideally result in mm_new having identical feature names as mm so that predict() can accept mm_new.

I want to add that I'm aware of this approach, which essentially suggests to include the observations from df_new in df before calling model.matrix(). This work fine if I have all the observations to begin with, and I'm just training and testing models. However, the new observations will only be accessible in the future (in a production prediction pipeline), and I want to avoid the overhead of re-loading the entire training dataframe for new predictions.

Bridges answered 24/4, 2017 at 2:18 Comment(0)

I found exactly what I needed available in the documentation for model.matrix and model.frame, and wanted to share. There is an argument in model.matrix called xlev which is "to be used as argument of model.frame if data is such that model.frame is called."

If model.matrix calls model.frame, xlev expects a list of character vectors for each factor in the dataframe (with the list element name being the factor name); each character vector contains the full set of factor levels needed to build the new model.matrix with the same dummy features as the original model.matrix.

Here's a working example:

set.seed(1)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
                 x2 = rnorm(20, 100, 5),
                 x3 = factor(sample(c("U","L"), replace = T, 20)),
                 y = rnorm(20, 10, 2))

mm <- model.matrix(y~., data = df)

# this is a list of levels for each factor in the original df
xlevs <- lapply(df[,sapply(df, is.factor), drop = F], function(j){
  levels(j)
})

# this is a new df with only a subset of the levels of the original factors
df_new <- data.frame(x1 = c("B", "C"),
                     x2 = rnorm(2, 100, 5),
                     x3 = c("U","U"))

# calling "xlev = " builds out a model.matrix with identical levels as the original df
mm_new <- model.matrix(~., data = df_new[1,], xlev = xlevs)

Note that this solution only handles factor levels that are a subset of the original factor levels. It isn't intended to handle new factor levels.

Bridges answered 16/5, 2017 at 17:39 Comment(1)

amazing - xlevs can also be saved via saveRDS(xlevs, file = "test.rds") – Uncoil 12/9, 2019 at 10:15

The problem with model.matrix() is that it does not save any transforming parameters. I write a package called ModelMatrixModel, ModelMatrixModel() function in the package returns a class that stores the transformed matrix and the transforming parameters, including factor levels information and orthogonal polynomials coefficients, which can then be apply to new data. It also give many options, such as handling invalid levels, keeping first dummy variable , returning sparse matrix, and scaling the output matrix.

#devtools::install_github("xinyongtian/R_ModelMatrixModel")

library(ModelMatrixModel)
df <- data.frame(x1 = factor(sample(LETTERS[1:5], replace = T, 20)),
                 x2 = rnorm(20, 100, 5),
                 x3 = factor(sample(c("U","L"), replace = T, 20)),
                 y = rnorm(20, 10, 2))
df_new <- data.frame(x1 = c("B", "C"),
                     x2 = rnorm(2, 100, 5),
                     x3 = c("U","U"))

m <- ModelMatrixModel(y~1+x1+x2+x3, data = df,remove_1st_dummy = T,sparse=F)
head(m$x,2)
##   _Intercept_ x1B x1C x1D x1E        x2 x3U
## 1           1   0   0   0   0  93.64492   0
## 2           1   1   0   0   0 101.08855   1
m_new=predict(m,df_new)
head(m_new$x,2)
##   _Intercept_ x1B x1C x1D x1E        x2 x3U
## 1           1   1   0   0   0 106.63825   1
## 2           1   0   1   0   0  99.00571   1

Rora answered 10/8, 2021 at 20:44 Comment(0)

Recommended topics

Hot tags