How to do faster list-column operations inside data.table

Asked 12/7, 2019 at 23:58 Answered 13/7, 2019 at 15:5

Due to memory (and speed) issues, I was hoping to do some computations inside a data.table instead of doing them outside it.

The following code has 100.000 rows, but I'm working with 40 million rows.

library(tictoc)
library(data.table) # version 1.11.8
library(purrr)
library(furrr)
plan(multiprocess)

veryfing_function <- function(vec1, vec2){
  vector <- as.vector(outer(vec1, vec2, paste0))
  split(vector, ceiling(seq_along(vector)/length(vec1)))
}


dt <- data.table(letters = replicate(1e6, sample(letters[1:5], 3, TRUE), simplify = FALSE),
                 numbers = replicate(1e6, sample(letters[6:10], 3, TRUE), simplify = FALSE))



tic()
result1 <- future_map2(dt$letters, dt$numbers, veryfing_function)
toc()


tic()
result2 <- mapply(veryfing_function, dt$letters, dt$numbers, SIMPLIFY = FALSE)
toc()



tic()
dt[, result := future_map2(letters, numbers, veryfing_function)]
toc()


tic()
dt[, result2 := mapply(veryfing_function, letters, numbers, SIMPLIFY = FALSE)]
toc()

The output is the same for all variants and as expected. The benchmarks were:

26 secs 72 secs 38 secs 105 secs , so I saw no advantage in using the functions inside the data.table or using mapply.

My major concern is memory, which is not resolved with the future_map2 solution.

I´m using Windows right now, so I was hoping to find a solution for speed other than mclapply, maybe some data.table trick I´m not seeing (keying is not supported for lists)

Brunella answered 12/7, 2019 at 23:58 Comment(0)

This is really a question about memory and data storage types. All of my discussion will be for 100,000 data elements so that everything doesn't bog down.

Let's examine a vector of length 100,000 vs. a list containing 100,000 separate elements.

object.size(rep(1L, 1E5))
#400048 bytes
object.size(replicate(1E5, 1, simplify = F))
#6400048 bytes

We went from 0.4 MB to 6.4 MB just by having the data stored differently!! When applying this to your function Map(veryfing_function, ...) and only 1E5 elements:

dt <- data.table(letters = replicate(1e5, sample(letters[1:5], 3, TRUE), simplify = FALSE),
                 numbers = replicate(1e5, sample(letters[6:10], 3, TRUE), simplify = FALSE))

tic()
result2 <- Map(veryfing_function, dt[['letters']], dt[['numbers']])
toc()
# 11.93 sec elapsed
object.size(result2)
# 109,769,872 bytes
#example return:
[[1000]]
[[1000]]$`1`
[1] "cg" "bg" "cg"

[[1000]]$`2`
[1] "ch" "bh" "ch"

[[1000]]$`3`
[1] "ch" "bh" "ch"

We could do a simple modification to your function to return unnamed lists instead of splitting and we save a little bit of memory as the split() appears to give named lists and I don't think we need the name:

verifying_function2 <- function(vec1, vec2) {
  vector <- outer(vec1, vec2, paste0) #not as.vector
  lapply(seq_len(ncol(vector)), function(i) vector[, i]) #no need to split, just return a list
}

tic()
result2_mod <- Map(verifying_function2, dt[['letters']], dt[['numbers']])
toc()
# 2.86 sec elapsed
object.size(result2_mod)
# 73,769,872 bytes

#example_output
[[1000]]
[[1000]][[1]]
[1] "cg" "bg" "cg"

[[1000]][[2]]
[1] "ch" "bh" "ch"

[[1000]][[3]]
[1] "ch" "bh" "ch"

The next step is why return a list of list at all. I am using lapply() in the modified function just get to your output. Loosing the lapply() would instead a list of matrices which I think would be as helpful:

tic()
result2_mod2 <- Map(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
toc()
# 1.66 sec elapsed
object.size(result2_mod2)
# 68,570,336 bytes

#example output:
[[1000]]
     [,1] [,2] [,3]
[1,] "cg" "ch" "ch"
[2,] "bg" "bh" "bh"
[3,] "cg" "ch" "ch"

The last logical step is to just return a matrix. Note this whole time we've been fighting against simplification with mapply(..., simplify = F) which is equivalent to Map().

tic()
result2_mod3 <- mapply(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
toc()
# 1.3 sec elapsed
object.size(result2_mod3)
# 7,201,616 bytes

If you want some dimensionality, you can convert the large matrix into a 3D array:

tic()
result2_mod3_arr <- array(as.vector(result2_mod3), dim = c(3,3,1E5))
toc()
# 0.02 sec elapsed
result2_mod3_arr[,,1000]
     [,1] [,2] [,3]
[1,] "cg" "ch" "ch"
[2,] "bg" "bh" "bh"
[3,] "cg" "ch" "ch"
object.size(result2_mod3_arr)
# 7,201,624 bytes

I also looked at @marbel's answer - it is faster and takes up only slightly more memory. My approach would likely benefit by converting the initial dt list to something else sooner.

tic()
dt1 = as.data.table(do.call(rbind, dt[['letters']]))
dt2 = as.data.table(do.call(rbind, dt[['numbers']]))

res = data.table()

combs = expand.grid(names(dt1), names(dt2), stringsAsFactors=FALSE)

set(res, j=paste0(combs[,1], combs[,2]), value=paste0( dt1[, get(combs[,1])], dt2[, get(combs[,2])] ) )
toc()
# 0.14 sec elapsed
object.size(res)
# 7,215,384 bytes

tl;dr - convert your object to a matrix or data.frame to make it easier on your memory. It also makes sense that the data.table versions of your function takes longer - there's likely more overhead than just directly applying mapply().

Orang answered 13/7, 2019 at 15:5 Comment(0)

This is a different approach to the problem but I believe it can be useful.

The output is different so I'm not sure without more info if it will serve your concrete problem but here it goes, hope it helps!

The timing is 1.165 seconds versus 87 secs of mapply.

vec1 = replicate(1e6, sample(letters[1:5], 3, TRUE), simplify = FALSE)
vec2 = replicate(1e6, sample(letters[6:10], 3, TRUE), simplify = FALSE)

dt <- data.table(v1 = vec1, v2 = vec2)
dt1 = as.data.table(do.call(rbind, vec1))
dt2 = as.data.table(do.call(rbind, vec2))
res = data.table()

tic()
cols1 = names(dt1)
cols2 = names(dt2)
combs = expand.grid(cols1, cols2, stringsAsFactors=FALSE)

for(i in 1:nrow(combs)){
  vars = combs[i, ]
  set(res, j=paste0(vars[,1], vars[,2]), value=paste0( dt1[, get(vars[,1])], dt2[, get(vars[,2])] ) )
}
toc()

Slippery answered 13/7, 2019 at 3:32 Comment(1)

You can also get rid of the loop at the end with set(res, j = paste0(combs[, 1], combs[ ,2])... I also think the tic() should start before the do.call(rbind, vec1)). But this is a brilliant approach. Mapply = 116s, future_map2 = 81s, @Slippery = 2.5s, losing the loop = 1.4s. – Orang 13/7, 2019 at 12:58

Recommended topics

Hot tags