The other answers here point out ways to re-code your categorical factors as dummies. Depending on your application, it may not be a great solution. If all you care about is prediction, then this is probably fine, and the approach provided by Flo.P should be okay. LASSO will find you a useful set of variables, and you probably won't be over-fit.
However, if you're interested in interpreting your model or discussing which factors are important after the fact, you're in a weird spot. The default coding that model.matrix has very specific interpretations when taken by themselves. model.matrix uses what is referred to as "dummy coding". (I remember learning it as "reference coding"; see here for a summary.) That means that if one of these dummies is included, your model now has a parameter whose interpretation is "the difference between one level of this factor and an arbitrarily chosen other level of that factor". And maybe none of the other dummies for that factor were selected. You may also find that if the ordering of your factor levels changes, you end up with a different model.
There are ways to deal with this, but rather than cludge something together, I'd try the group lasso. Building on Flo.P's code above:
install.packages("gglasso")
library(gglasso)
create_factor <- function(nb_lvl, n= 100 ){
factor(sample(letters[1:nb_lvl],n, replace = TRUE))}
df <- data.frame(var1 = create_factor(5),
var2 = create_factor(5),
var3 = create_factor(5),
var4 = create_factor(5),
var5 = rnorm(100),
y = rnorm(100))
y <- df$y
x <- model.matrix( ~ ., dplyr::select(df, -y))[, -1]
groups <- c(rep(1:4, each = 4), 5)
fit <- gglasso(x = x, y = y, group = groups, lambda = 1)
fit$beta
So since we didn't specify a relationship between our factors (var1, var2, etc.) and y, the LASSO does a good job and sets all coefficients to 0 except when the minimum amount of regularization is applied. You can play around with values for lambda (a tuning parameter) or just leave the option blank and the function will pick a range for you.
model.matrix
which will recode your factor variables using dummy variables. You may also want to look at the group lasso – Assentationlars(x=x_train,y=df$var5)
with the example below seems to work fine. Do you haveNA
values in your input data? – Pristine