How to resolve integer overflow errors in R estimation
Asked Answered
P

2

9

I'm trying to estimate a model using speedglm in R. The dataset is large (~69.88 million rows and 38 columns). Multiplying the number of rows and columns results in ~2.7 billion which is outside the integer limit. I can't provide the data, but the following examples recreate the issue.

library(speedglm)

# large example that works 
require(biglm)
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))   
working.example <- speedglm(fo, data = da, family = Gamma(log))

# repeat with large enough size to break 
k <- 5000       # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))   
failed.example <- speedglm(fo, data = da, family = Gamma(log))

# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)

This causes an error and integer overflow warning.

Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n,  :  
  missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow 

I understand the warning, but I do not understand the error. They seem to be related in this case as they appear together after each attempt.

Removing columns allows the estimation to complete. It does not seem to matter which columns are removed; removing interacted or non-interacted variables will both result in a completed estimation. The chunksize option was added after receiving the error initially, but has not helped.

My questions are: (1) what causes the first error? (2) is there a way to estimate models using data such that the number of rows by the number of columns is larger than the integer limit? (3) is there a better na.action to use in this case?

Thanks,

JP.

Running: R version 3.3.3 (2017-03-06)

Actual code below:

dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0", 
             "ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650", 
             "FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex", 
             "Row", "Modular", "Rural", "FirstTimeBuyer", 
             "FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio", 
             "VintageBefore2001", "NFLD", "yoy.fcpwti:province_n") 
logit1 <- speedglm(formula = paste("DefaultFlag ~ ", 
                                   paste(dft_var, collapse = "+"), 
                                   sep = ""), 
                   family = binomial(logit), 
                   na.action = na.exclude, 
                   data = default.data,
                   chunksize = 1*10^7)
Pentangular answered 6/6, 2017 at 19:56 Comment(0)
L
5

Update:

Based on my investigation below, @James figured out that the problem can be avoided by providing non-NULL value for the parameter sparse in the call of the speedglm function, as it prevents the internal call of the is.sparse function.

Using the example above, the following should now work:

speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)

My original answer:

Both the warning and the error come from the same line in the function is.sparse in the package speedglm.

The line is:

sample(X,round((nrow(X)*ncol(X)*camp),digits=0),replace=FALSE)

The warning happens because of the use of nrow(X)*ncol(X) for a large matrix. The nrow and ncol functions return integer values, which can overflow. here is an illustration.

nr = 1000000L
nc = 1000000L
nr*nc
# [1] NA
# Warning message:
# In nr * nc : NAs produced by integer overflow

The error happens because the sample function is confused when X is a large matrix and size = NA. Here is an illustration:

sample(matrix(1,3000,1000000), NA, replace=FALSE)
# Error in if (useHash) .Internal(sample2(n, size)) else .Internal(sample(n,  : 
# missing value where TRUE/FALSE needed
Leaf answered 6/6, 2017 at 20:25 Comment(5)
Thanks for the response @Andrey. I think I understand the issue, but I'm unsure how to resolve it. Are you saying if I bypass the sample command in speedglm it should work?Pentangular
I think it's time to bring this to the attention of the authors of the package. They should attempt to avoid R limitations and make it work for large data like yours.Leaf
On the other hand, the fix may be as easy as replacing nrow(X)*ncol(X) with length(X). I do not know if you would run into any issues once this is solved.Leaf
I just emailed the package maintainer pointing him to this thread. I submitted a response below that solves the issue. Not sure if it's proper etiquette to do so, so if you'd like I can edit your response with the answer and accept it.Pentangular
I've updated my answer to include your findings. I would appreciate if you accept it. I think it's worth keeping your answer as well, as it confirms your contribution to solving the problem.Leaf
P
2

Thanks to @Andrey 's guidance I was able to solve the problem. The issue was the sample function in the is.sparse check. To bypass this I set sparse=FALSE in the options for speedglm (this should work for sparse=TRUE as well, though I haven't tried.) This is because speedglm calls is.sparse via speedglm.wfit in the following way:

if (is.null(sparse))
    sparse <- is.sparse(x = x, sparsellim, camp)

So setting sparse avoids the is.sparse function.

Using the example above, the following should now work:

speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
Pentangular answered 7/6, 2017 at 13:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.