This is really a question about memory and data storage types. All of my discussion will be for 100,000 data elements so that everything doesn't bog down.
Let's examine a vector of length 100,000 vs. a list containing 100,000 separate elements.
object.size(rep(1L, 1E5))
#400048 bytes
object.size(replicate(1E5, 1, simplify = F))
#6400048 bytes
We went from 0.4 MB to 6.4 MB just by having the data stored differently!! When applying this to your function Map(veryfing_function, ...)
and only 1E5 elements:
dt <- data.table(letters = replicate(1e5, sample(letters[1:5], 3, TRUE), simplify = FALSE),
numbers = replicate(1e5, sample(letters[6:10], 3, TRUE), simplify = FALSE))
tic()
result2 <- Map(veryfing_function, dt[['letters']], dt[['numbers']])
toc()
# 11.93 sec elapsed
object.size(result2)
# 109,769,872 bytes
#example return:
[[1000]]
[[1000]]$`1`
[1] "cg" "bg" "cg"
[[1000]]$`2`
[1] "ch" "bh" "ch"
[[1000]]$`3`
[1] "ch" "bh" "ch"
We could do a simple modification to your function to return unnamed lists instead of splitting and we save a little bit of memory as the split()
appears to give named lists and I don't think we need the name:
verifying_function2 <- function(vec1, vec2) {
vector <- outer(vec1, vec2, paste0) #not as.vector
lapply(seq_len(ncol(vector)), function(i) vector[, i]) #no need to split, just return a list
}
tic()
result2_mod <- Map(verifying_function2, dt[['letters']], dt[['numbers']])
toc()
# 2.86 sec elapsed
object.size(result2_mod)
# 73,769,872 bytes
#example_output
[[1000]]
[[1000]][[1]]
[1] "cg" "bg" "cg"
[[1000]][[2]]
[1] "ch" "bh" "ch"
[[1000]][[3]]
[1] "ch" "bh" "ch"
The next step is why return a list of list at all. I am using lapply()
in the modified function just get to your output. Loosing the lapply()
would instead a list of matrices which I think would be as helpful:
tic()
result2_mod2 <- Map(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
toc()
# 1.66 sec elapsed
object.size(result2_mod2)
# 68,570,336 bytes
#example output:
[[1000]]
[,1] [,2] [,3]
[1,] "cg" "ch" "ch"
[2,] "bg" "bh" "bh"
[3,] "cg" "ch" "ch"
The last logical step is to just return a matrix. Note this whole time we've been fighting against simplification with mapply(..., simplify = F)
which is equivalent to Map()
.
tic()
result2_mod3 <- mapply(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
toc()
# 1.3 sec elapsed
object.size(result2_mod3)
# 7,201,616 bytes
If you want some dimensionality, you can convert the large matrix into a 3D array:
tic()
result2_mod3_arr <- array(as.vector(result2_mod3), dim = c(3,3,1E5))
toc()
# 0.02 sec elapsed
result2_mod3_arr[,,1000]
[,1] [,2] [,3]
[1,] "cg" "ch" "ch"
[2,] "bg" "bh" "bh"
[3,] "cg" "ch" "ch"
object.size(result2_mod3_arr)
# 7,201,624 bytes
I also looked at @marbel's answer - it is faster and takes up only slightly more memory. My approach would likely benefit by converting the initial dt
list to something else sooner.
tic()
dt1 = as.data.table(do.call(rbind, dt[['letters']]))
dt2 = as.data.table(do.call(rbind, dt[['numbers']]))
res = data.table()
combs = expand.grid(names(dt1), names(dt2), stringsAsFactors=FALSE)
set(res, j=paste0(combs[,1], combs[,2]), value=paste0( dt1[, get(combs[,1])], dt2[, get(combs[,2])] ) )
toc()
# 0.14 sec elapsed
object.size(res)
# 7,215,384 bytes
tl;dr - convert your object to a matrix or data.frame to make it easier on your memory. It also makes sense that the data.table
versions of your function takes longer - there's likely more overhead than just directly applying mapply()
.
set(res, j = paste0(combs[, 1], combs[ ,2])...
I also think thetic()
should start before thedo.call(rbind, vec1))
. But this is a brilliant approach. Mapply = 116s, future_map2 = 81s, @Slippery = 2.5s, losing the loop = 1.4s. – Orang