Using LASSO in R with categorical variables
Asked Answered
N

2

13

I've got a dataset with 1000 observations and 76 variables, about twenty of which are categorical. I want to use LASSO on this entire data set. I know that having factor variables doesn't really work in LASSO through either lars or glmnet, but the variables are too many and there are too many different, unordered values they can take on to reasonably recode them numerically.

Can LASSO be used in this situation? How do I do this? Creating a matrix of the predictors yields this response:

hdy<-as.numeric(housingData2[,75])
hdx<-as.matrix(housingData2[,-75])
model.lasso <- lars(hdx, hdy)
Error in one %*% x : requires numeric/complex matrix/vector arguments

I realize that other methods may be easier or more appropriate, but the challenge is actually to do this using lars or glmnet, so if it's possible, I would appreciate any ideas or feedback.

Thank you,

Nysa answered 21/10, 2017 at 17:9 Comment(5)
create your predictor matrix using model.matrix which will recode your factor variables using dummy variables. You may also want to look at the group lassoAssentation
So, using hdx<-model.matrix(~ ., data=xdata, contrasts.arg = sapply(xdata, is.factor)) I am able to make that work, but then subsequently plugging that into lars() gives me the error "Error in if (any(nosignal)) { : missing value where TRUE/FALSE needed". I don't know where if (any(nosignal)) is, but it's not any code I intentionally ran. I'm not entirely familiar with the inner workings of lasso, so sorry.Nysa
good so far, but we would need a reproducible example to help you farther. Using lars(x=x_train,y=df$var5) with the example below seems to work fine. Do you have NA values in your input data?Pristine
Yes, there are many NAs. When I use what Flo.P did (thank you by the way, that makes total sense) and adapt it to my data, I get the error: Error in glmnet(x, y, weights = weights, offset = offset, lambda = lambda, : number of observations in y (1000) not equal to the number of rows of x (0) and when I do the lars(x=x_train,y=housingData2$SalePrice) I get the same TRUE/FALSE error.Nysa
Flo.P's approach is best for further reading see: users.stat.umn.edu/~zouxx019/Papers/gglasso-paper.pdf Just to clarify, the group variable in the "groups" variable fed to gglasso refers to the groups of dummy variables. E.g. which groups of dummy variables were once a single variable. This is important as it makes no sense to have a single dummy variable included in your model if the others aren't significant.Usufruct
D
8

The other answers here point out ways to re-code your categorical factors as dummies. Depending on your application, it may not be a great solution. If all you care about is prediction, then this is probably fine, and the approach provided by Flo.P should be okay. LASSO will find you a useful set of variables, and you probably won't be over-fit.

However, if you're interested in interpreting your model or discussing which factors are important after the fact, you're in a weird spot. The default coding that model.matrix has very specific interpretations when taken by themselves. model.matrix uses what is referred to as "dummy coding". (I remember learning it as "reference coding"; see here for a summary.) That means that if one of these dummies is included, your model now has a parameter whose interpretation is "the difference between one level of this factor and an arbitrarily chosen other level of that factor". And maybe none of the other dummies for that factor were selected. You may also find that if the ordering of your factor levels changes, you end up with a different model.

There are ways to deal with this, but rather than cludge something together, I'd try the group lasso. Building on Flo.P's code above:

install.packages("gglasso")
library(gglasso)


create_factor <- function(nb_lvl, n= 100 ){
  factor(sample(letters[1:nb_lvl],n, replace = TRUE))}

df <- data.frame(var1 = create_factor(5), 
                 var2 = create_factor(5), 
                 var3 = create_factor(5), 
                 var4 = create_factor(5),
                 var5 = rnorm(100),
                 y = rnorm(100))

y <- df$y
x <- model.matrix( ~ ., dplyr::select(df, -y))[, -1]
groups <- c(rep(1:4, each = 4), 5)
fit <- gglasso(x = x, y = y, group = groups, lambda = 1)
fit$beta

So since we didn't specify a relationship between our factors (var1, var2, etc.) and y, the LASSO does a good job and sets all coefficients to 0 except when the minimum amount of regularization is applied. You can play around with values for lambda (a tuning parameter) or just leave the option blank and the function will pick a range for you.

Denis answered 29/6, 2018 at 18:7 Comment(1)
Can you help explain why var1a, var2a, etc.. are missing? This produces some strange answers. For example, I set some of the vars to have a higher mean and still get all 0 coefficients. So, modify the y systematically (e.g. add 15 to each value) for var1=="a". Your coefficient estimates dont change from 0. This does not seem right at all. Is there a bug in this code?Aludel
M
3

You can make dummy variables from your factor using model.matrix.

I create a data.frame. y is the target variable.

create_factor <- function(nb_lvl, n= 100 ){
  factor(sample(letters[1:nb_lvl],n, replace = TRUE))}

df <- data.frame(var1 = create_factor(5), 
           var2 = create_factor(5), 
           var3 = create_factor(5), 
           var4 = create_factor(5),
           var5 = rnorm(100),
           y = create_factor(2))


    # var1 var2 var3 var4        var5   y
    # 1    a    c    c    b -0.58655607 b
    # 2    d    a    e    a  0.52151994 a
    # 3    a    b    d    a -0.04792142 b
    # 4    d    a    a    d -0.41754957 b
    # 5    a    d    e    e -0.29887004 a

Select all the factor variables. I use dplyr::select_if then parse variables names to get an expression like y ~ var1 + var2 +var3 +var4

library(dplyr)
library(stringr)
library(glmnet)
vars_name <- df %>% 
  select(-y) %>% 
  select_if(is.factor) %>% 
  colnames() %>% 
  str_c(collapse = "+") 

model_string <- paste("y  ~",vars_name )

Create dummy variables with model.matrix. Don't forget the as.formula to coerce character to formula.

 x_train <- model.matrix(as.formula(model_string), df)

Fit your model.

 lasso_model <- cv.glmnet(x=x_train,y = df$y, family = "binomial", alpha=1, nfolds=10)

The code could be simplified. But the idea is here.

Matteo answered 21/10, 2017 at 20:17 Comment(2)
So this all works up until the last part. When I do that, I get the error "Error in glmnet(x, y, weights = weights, offset = offset, lambda = lambda, : number of observations in y (1000) not equal to the number of rows of x (0)" which makes sense when I look at it, because x_train appears to be a matrix of num[0,1:128]. Is that right?Nysa
Ok so all your rows have at least one NA. You need to handle your missing values by imputing them. Maybe your have some columns with a lot of NA' that you can remove. When you have a dataset with enough complete rows it may work with: lasso_model <- cv.glmnet(x=x_train,y = na.omit(df$y), family = "binomial", alpha=1, nfolds=10) (I added na.omit df$y)Matteo

© 2022 - 2024 — McMap. All rights reserved.