Big matrix to run glmnet()
Asked Answered
M

2

21

I am having a problem to run glmnet lasso with a wide data set. My data has N=50, but p > 49000, all factors. So to run glmnet i have to create a model.matrix, BUT i just run out of memory when i call model.matrix(formula, data), where formula = Class ~ .

As a worked example i will generate a dataset:

data <- matrix(rep(0,50*49000), nrow=50)
for(i in 1:50) {
x = rep(letters[2:8], 7000)
y = sample(x=1:49000, size=49000)
data[i,] <- x[y]
}

data <- as.data.frame(data)
x = c(rep('A', 20), rep('B', 15), rep('C', 15))
y = sample(x=1:50, size=50)
class = x[y]
data <- cbind(data, class)

After that i tried to create a model.matrix to enter on glmnet.

  formula <- as.formula(class ~ .)
  X = model.matrix(formula, data)
  model <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

In the last step (X = model.matrix ...) i run out of memory. What can i do?

Menides answered 10/6, 2013 at 20:29 Comment(3)
Time for more RAM. (Or restart with a minimal set of applications and data.) That's only a 24MB wide object.Clementius
Well i have only 50 samples. I can't belive that there is no solution!Menides
I did not say there was no solution.Clementius
M
27

I asked Professor Trevor Hastie and received the following advice:

"Hello Flavio

model.matrix is killing you. You will have 49K factors, and model matrix is trying to represent them as contrasts which will be 6 column matrices, so 49*6 approx 300K columns. Why not make binary dummy variables (7 per factor), and simply construct this directly without using model.matrix. You could save 1/7th the space by storing this via sparseMatrix (glmnet accepts sparse matrix formats)"

I did exactly that and worked perfectly fine. I think that can be usefull to others.

An article, with code, that came form this problem: http://www.rmining.net/2014/02/25/genetic-data-large-matrices-glmnet/

In order to avoid broken links i will post part of the post here:

The problem with the formula approach is that, in general, genomic data has more columns than observations. The data that I worked in that case had 40,000 columns and only 73 observations. In order to create a small set of test data, run the following code:

for(i in 1:50) {
    x = rep(letters[2:8], 7000)
    y = sample(x=1:49000, size=49000)
    data[i,] <- x[y]
}

data <- as.data.frame(data)
x <- c(rep('A', 20), rep('B', 15), rep('C', 15))
y <- sample(x=1:50, size=50)
class = x[y]
data <- cbind(data, class)

So, with this data set we will try to fit a model with glmnet ():

formula <- as.formula(class ~ .)
X <- model.matrix(formula, data)
model <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

And if you do not have a computer with more RAM than mine, you will probably leak memory and give a crash in R. The solution? My first idea was to try sparse.model.matrix() that creates a sparse matrix model using the same formula. Unfortunately did not work, because even with sparse matrix, the final model is still too big! Interestingly, this dataset occupies only 24MB from RAM, but when you use the model.matrix the result is an array with more than 1Gb.

The solution I found was to build the matrix on hand. To do this we encode the array with dummy variables, column by column, and store the result in a sparse matrix. Then we will use this matrix as input to the model and see if it will not leak memory:

## Creates a matrix using the first column
X <- sparse.model.matrix(~data[,1]-1)

## Check if the column have more then one level
for (i in 2:ncol(data)) {

## In the case of more then one level apply dummy coding 
if (nlevels(data[,i])>1) {
    coluna <- sparse.model.matrix(~data[,i]-1)
    X <- cBind(X, coluna)
}
## Transform fator to numeric
else {
   coluna <- as.numeric(as.factor(data[,i]))
   X <- cBind(X, coluna)
}

NOTE: Pay attention to how we are using a sparse matrix the Matrix package is required. Also note that the columns are connected using cBind () instead of cbind ().

The matrix thus generated was much lower: less than 70 Mb when I tested. Fortunately glmnet () supports a sparse matrix and you can run the model:

mod.lasso <- cv.glmnet(X, class, standardize=FALSE, family='multinomial', alpha=1, nfolds=10)

So you can create models with this type of data without blowing the memory and without use R packages for large datasets like bigmemory and ff.

Menides answered 17/6, 2013 at 15:53 Comment(5)
you could also try Matrix::sparse.model.matrix or MatrixModels::modelMatrix(*,sparse=TRUE)Yeld
That doesn't work!!! Try for yourself with the example. The object created with sparse.model.matrix is very much bigger. I tried that before post this question.Menides
@FlavioBarros - maybe add a sample of the code that you used? I think that would be helpful.Lipolysis
is this the best answer? Seems not convenient. To avoid broken links, it would be ideal to add the code to the answer.Respondent
@MartínBel, the problem is fixed. Thanks for the sugestion.Menides
G
7

To whom might be interested. I have developed an R package called biglasso which fits lasso-type models with Big Data. It works with memory-mapped (big) design matrix based on bigmemory package, and can seamlessly work for data-larger-than-RAM cases. Moreover, it's more computation- and memory-efficient as compared to glmnet by using newly proposed feature screening rules as well as better implementation. Please check the GitHub page for details, and feel free to provide any suggestions/comments.

Guipure answered 21/12, 2016 at 5:27 Comment(4)
This thread comes up when you google r glmnet faster. So, good place for this info, even if not beneficial to OP.Privilege
Thanks for making this. It's great.Igniter
but biglasso does not support AUC evaluation metricConfirm
@Confirm biglasso supports AUC now, since release 1.4-1 github.com/YaohuiZeng/biglasso/blob/master/…Igniter

© 2022 - 2024 — McMap. All rights reserved.