Large Matrices in R: long vectors not supported yet
Asked Answered
P

4

22

I am running 64 bit R 3.1 in a 64bit Ubuntu environment with 400GB of RAM, and I am encountering a strange limitation when dealing with large matrices.

I have a numeric matrix called A, that is 4000 rows by 950,000 columns. When I try to access any element in it, I receive the following error:

Error: long vectors not supported yet: subset.c:733

Although my matrix was read in via scan, you can replicate with the following code

test <- matrix(1,4000,900000) #no error
test[1,1] #error

My Googling reveals this was a common error message prior to R 3.0, where a vector of size 2^31-1 was the limit. However, this is not the case, given my environment.

Should I not be using the native matrix type for this kind of matrix?

Putandtake answered 20/6, 2014 at 21:10 Comment(9)
"There is some support for matrices and arrays with each dimension less than 2^31 but total number of elements more than that." Note the word "some" and the word "yet" in the error message.Squally
That's an interesting error. Curious that test[1] works, as well as test[,1][1]. Even test[1:2,1:2] works, but not the original test[1,1].Lithium
take a look to the ff and bigmemory packagesArabian
@AndreyShabalin Looking at the line in question, it appears that that case is using LENGTH(x), whereas the block just above it is using XLENGTH(x). As mentioned....it's a work in progress.Goulash
@AndreyShabalin ...and here is the section in the headers that sets out the difference between LENGTH and XLENGTH.Goulash
@Goulash I was trying to make sense of that too. Notice too that the index scalars are instantiated as R_len_t (standard vectors) and R_xlen_t (long support).Conti
@Goulash and from line 62 above it for the values they may take.Conti
@joran, I understand that it is a work in progress. My point actually was that the large matrix is still pretty functional (except for the issue in the question).Lithium
This error does not occur in R 3.4.3 on Linux.Churinga
S
22

A matrix is just an atomic vector with a dimension attribute which allows R to access it as a matrix. Your matrix is a vector of length 4000*9000000 which is 3.6e+10 elements (the largest integer value is approx 2.147e+9). Subsetting a long vector is supported for atomic vectors (i.e. accessing elements beyond the 2.147e+9 limit). Just treat your matrix as a long vector.

If we remember that by default R fills matrices column-wise then if we wanted to retrieve say the value at test[ 2701 , 850000 ] we could access it via:

i <- ( 2701 - 1 ) * 850000 + 2701 
test[i]
#[1] 1

Note that this really is long vector subsetting because:

2701L * 850000L
#[1] NA
#Warning message:
#In 2701L * 850000L : NAs produced by integer overflow
Superficial answered 20/6, 2014 at 21:56 Comment(6)
Thanks for the great answer. Could you please help me understand your second statement? 2701L * 850000L, why would that produce NA when 2701*850000 does not? I would have thought that by specifying L, it would store it as a long integer and make it capable of handling such a large number.Putandtake
because L explicitly specifies an integer type. class(2701) is "numeric" (similarly for 850000). (I think) R doesn't have native long integers available for the end-user (see ?integer). (Don't know/remember why L is the integer code, maybe look in the R language manual ... ?Quince
Thank you for your comment @BenBolker. For anyone else's benefit who may be reading, numeric and double are the same. So when talking about "long vector" that really just means "a vector that is long," not a vector that is indexed by a long integer, because a long integer does not exist in R. So, when Simon wrote 2701L*850000L results in NA it is because we are forcing to use the Integer type which has the limit of 2.147e9. Without the L, we are using numeric (which is double and has a much larger range). So the L has nothing to do with the long int of C :)Putandtake
@Putandtake the long int type of C was traditionally 32bit when it was introduced (and an int type was 16 bits). R has been around for a while so I disagree with you and theorise (and Prof. Ripley agrees) that it is shorthand for long int. In fact I wrote a question and answer about this!Conti
That would make a great deal of sense--thanks for the history, Simon.Putandtake
Nope, this solution is wrong. For example, when z = matrix(1:9, 3, 3) and z[2, 3] # 8, this happens z[ (2 - 1) * 3 + 2 ] # 5. This would be correct: z[ (2) * 3 + 2 ] # 8Belldame
B
3

An alternate, quick-hand solution would be to first get the row and then the column (now the i'th element of the resulting vector) of the matrix. For example ...

test <- matrix(1,4000,900000) #no error 
test[1,1] #error
test[1, ][1] # no error

Of course, this produces some overhead, as the whole row is copied/accessed first, but it's more straightforward to read. Also works for first extracting the column and then the row.

Belldame answered 14/6, 2016 at 23:54 Comment(0)
S
0

TL;DR - try to remove the cache=TRUE argument from the curly braces of the chunk header.

I had this error for dataframe with 1,720,238 observations and 302 variables, which is lower than the threshold @Simon has mentioned (1,720,238*302 = 5.2e+8 < 2.147e+9)

@subhash answer was the hint that led me to try and totally remove the cache argument, which fixed the error for me.

Sperling answered 21/6, 2023 at 9:1 Comment(0)
S
-3

library(knitr)

knitr::option$set(cache = TRUE, warning = FALSE,message = FALSE, cache.lazy = FALSE)

Stereogram answered 23/9, 2020 at 7:26 Comment(3)
what? what kind of answer is this?Emeldaemelen
This actually helped me, because my long vector problem stemmed from Rmd caching mechanism, see bookdown.org/yihui/rmarkdown-cookbook/cache-lazy.html But the answer is so out of context, I agree!Culpa
For more info, the logic behind this answer is probably from here: https://mcmap.net/q/267828/-quot-long-vectors-not-supported-yet-quot-error-in-rmd-but-not-in-r-scriptSperling

© 2022 - 2024 — McMap. All rights reserved.