I want to perform penalty selection for the LASSO algorithm and predict outcomes using tidymodels
. I will use the Boston housing dataset to illustrate the problem.
library(tidymodels)
library(tidyverse)
library(mlbench)
data("BostonHousing")
dt <- BostonHousing
I first split the dataset into train/test subsets.
dt_split <- initial_split(dt)
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
Defining preprocessing using the recipe
package.
rec <- recipe(medv ~ ., data = dt_train) %>%
step_center(all_predictors(), -all_nominal()) %>%
step_dummy(all_nominal()) %>%
prep()
Initialization of the model and the workflow. I use the glmnet
engine. mixture = 1
means that I choose the LASSO penalty and penalty = tune()
means that I will later use cross-validation to choose the best penalty parameter lambda
.
lasso_mod <- linear_reg(mode = "regression",
penalty = tune(),
mixture = 1) %>%
set_engine("glmnet")
wf <- workflow() %>%
add_model(lasso_mod) %>%
add_recipe(rec)
Preparing stratified 5-fold cross-validation and the penalty grid:
folds <- rsample::vfold_cv(dt_train, v = 5, strata = medv, nbreaks = 5)
my_grid <- tibble(penalty = 10^seq(-2, -1, length.out = 10))
Let us run the cross-validation:
my_res <- wf %>%
tune_grid(resamples = folds,
grid = my_grid,
control = control_grid(verbose = FALSE, save_pred = TRUE),
metrics = metric_set(rmse))
I'm now able to get the best penalty from the grid and update my workflow for this optimal penalty:
best_mod <- my_res %>% select_best("rmse")
print(best_mod)
final_wf <- finalize_workflow(wf, best_mod)
print(final_wf)
== Workflow ===================================================================================================================
Preprocessor: Recipe
Model: linear_reg()
-- Preprocessor ---------------------------------------------------------------------------------------------------------------
2 Recipe Steps
* step_center()
* step_dummy()
-- Model ----------------------------------------------------------------------------------------------------------------------
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 0.0278255940220712
mixture = 1
Computational engine: glmnet
So far so good. Now I want to apply the workflow to the training data to get my final model:
final_mod <- fit(final_wf, data = dt_train) %>%
pull_workflow_fit()
Now here is the issue.
final_mod$fit
is a elnet
and glmnet
object. It contains the full regularization path on a grid of 75 values of the penalty parameter. Hence the previous penalty tuning step is pretty much useless. So the prediction step fails:
predict(final_mod, new_data = dt)
returns an error:
Error in cbind2(1, newx) %*% nbeta :
invalid class 'NA' to dup_mMatrix_as_dgeMatrix
Of course I could use glmnet::cv.glmnet
to get the best penalty and then use the method predict.cv.glmnet
but I need a generic workflow able to work with multiple machine learning models using the same interface. In the documentation of parsnip::linear_reg
there is this note regarding the glmnet engine:
For glmnet models, the full regularization path is always fit regardless of the value given to penalty. Also, there is the option to pass multiple values (or no values) to the penalty argument. When using the predict() method in these cases, the return value depends on the value of penalty. When using predict(), only a single value of the penalty can be used. When predicting on multiple penalties, the multi_predict() function can be used. It returns a tibble with a list column called .pred that contains a tibble with all of the penalty results.
However, I do not understand how I should proceed to get the predictions of the tuned LASSO model using the tidymodels
framework. The multi_predict
function throws the same error as predict
.
predict(final_mod, new_data = dt)
. – Crushing