What is the fastest way of creating an identificator for multi-row groups with data.table in R?
Asked Answered
K

6

5

I have a dataframe that identifies a set of values with an id:

library(data.table)

dt <- data.table(
  id = rep(c("a", "b", "c"), each = 2),
  value1 = c(1, 1, 1, 2, 1, 1),
  value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#>    id value1 value2
#> 1:  a      1      0
#> 2:  a      1      3
#> 3:  b      1      0
#> 4:  b      2      3
#> 5:  c      1      0
#> 6:  c      1      3

As you can see, the ids a and c identify both the same set of values. So I want to create a "pattern id", that identifies the set of values associated with the ids a and c (obs: an id might identify more than two rows, I just limited them to two rows here for the sake of simplicity).

I did manage to come up with a solution using nested data.tables and match():

dt <- dt[, .(data = list(.SD)), by = id]

unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]

dt
#>    id pattern_id
#> 1:  a          1
#> 2:  b          2
#> 3:  c          1

It does the trick, but it is not as fast as I'd like it to be. match() documentation is pretty clear regarding its efficiency with lists:

Matching for lists is potentially very slow and best avoided except in simple cases.

As you can see, I don't need the actual pattern data in my final result, only a table that associates the ids to the pattern ids. I feel like nesting the data, using it to match and then dropping it afterwards is a bit wasteful, but not sure if there's a better way. I was thinking in something that transform each dataframe into a string, or, even better, something that avoided the nesting altogether, but I couldn't come up with anything better than what I have now.

I have created a bigger dataset to play around with and test different solutions:

set.seed(0)
size <- 1000000
dt <- data.table(
  id = rep(1:(size / 2), each = 2),
  value1 = sample(1:10, size, replace = TRUE),
  value2 = sample(1:10, size, replace = TRUE)
)
Kevon answered 22/3, 2022 at 21:8 Comment(6)
You can get some improvement with fastmatch::fmatchBrisco
Each id always comes in twos, two a, two b, two c, etc?Healing
In your toy example, if rows 5 and 6 are switched, would you still consider (a) and (c) equivalent? I did in my solution, but I see that your approach does not consider the equivalent.Witha
@Healing not necessarily. I should have added that to the description, will do now. Thanks!Kevon
@Witha Good catch. Switching the rows would make them not equivalent. That's not something I had put some thought into before you mentioned, so thanks for noticing that it might be a constraint.Kevon
I moved the benchmarking bit as a wiki answer below.Healing
W
1

Updated (to remove join):

This one replicates your approach (i.e. it requires that the order is the same as well as the values)

unique(
  dt[, pattern:=.(paste0(c(value1,value2), collapse=",")), by=id][,.(id,pattern)]
)[,grp:=.GRP, by=pattern][,pattern:=NULL]

       id   grp
   <char> <int>
1:      a     1
2:      b     2
3:      c     1

Prior solution:

dt[dt[, .(paste0(sort(c(value1,value2)), collapse=",")), by=id] %>% 
     .[,pattern:=.GRP, by=V1] %>% 
     .[,V1:=NULL], on=.(id)]

Output:

       id value1 value2 pattern
   <char>  <num>  <num>   <int>
1:      a      1      0       1
2:      a      1      3       1
3:      b      1      0       2
4:      b      2      3       2
5:      c      1      0       1
6:      c      1      3       1
Witha answered 22/3, 2022 at 21:32 Comment(1)
Thanks for joining in. I really like using paste() to create a string from the set of values and then identifying the groups based on this string. I have added a benchmark to the question that looks into your solution, and it seems to be the best. I have adapted it a bit to not rely on unique(), but I tested my adaption and your solution and they perfomed basically the same.Kevon
M
3

We can try the code below

dt[
    ,
    q := toString(unlist(.SD)), id
][
    ,
    pattern_id := .GRP, q
][
    ,
    q := NULL
][]

or

dt[
    ,
    q := toString(unlist(.SD)),
    id
][
    ,
    pattern_id := as.integer(factor(match(q, q)))
][
    ,
    q := NULL
][]

which gives

   id value1 value2 pattern_id
1:  a      1      0          1
2:  a      1      3          1
3:  b      1      0          2
4:  b      2      3          2
5:  c      1      0          1
6:  c      1      3          1
Menarche answered 22/3, 2022 at 22:54 Comment(1)
Great suggestion! I didn't know about .GRP, it's great! Your solution outperformed mine, but it's a bit slower than another one posted here, so I'm marking the other as the aswer here. Thanks anyway!Kevon
H
1

Here is some benchmarks with those that don't rely on each id identifying necessarily two rows and I'm posting the results below.

library(data.table)

set.seed(0)
size <- 500000
dt <- data.table(
  id = rep(1:(size / 2), each = 2),
  value1 = sample(1:10, size, replace = TRUE),
  value2 = sample(1:10, size, replace = TRUE)
)

my_solution <- function(x) {
  x <- x[, .(data = list(.SD)), by = id]

  unique_groups <- unique(x$data)
  x[, pattern_id := match(data, unique_groups)]
  x[, data := NULL]
  x[]
}

langtang_solution <- function(x) {
  x <- x[, .(data = paste0(value1, "|", value2, collapse = ";")), by = id]
  x[, pattern_id := .GRP, by = data]
  x[, data := NULL]
  x[]
}

thomasiscoding_solution <- function(x) {
  x <- x[, .(data = toString(unlist(.SD))), by = id]
  x[, pattern_id := .GRP, by = data]
  x[, data := NULL]
  x[]
}

identical(my_solution(dt), langtang_solution(dt))
#> [1] TRUE
identical(my_solution(dt), thomasiscoding_solution(dt))
#> [1] TRUE

microbenchmark::microbenchmark(
  my_solution(dt),
  langtang_solution(dt),
  thomasiscoding_solution(dt),
  times = 50L
)
#> Unit: seconds
#>                         expr      min       lq     mean   median       uq
#>              my_solution(dt) 3.174106 3.566495 3.818829 3.793850 4.015176
#>        langtang_solution(dt) 1.369860 1.467013 1.596558 1.529327 1.649607
#>  thomasiscoding_solution(dt) 3.014511 3.154224 3.280713 3.256732 3.370015
#>       max neval
#>  4.525275    50
#>  2.279064    50
#>  3.681657    50

This was very enriching. I didn't know about .GRP, which in my tests perform very similarly to match(), although a (very small) bit better. The best answer seems to be using paste() to convert the group into a string and then finding the group based on that string.

Healing answered 22/3, 2022 at 21:9 Comment(0)
W
1

Updated (to remove join):

This one replicates your approach (i.e. it requires that the order is the same as well as the values)

unique(
  dt[, pattern:=.(paste0(c(value1,value2), collapse=",")), by=id][,.(id,pattern)]
)[,grp:=.GRP, by=pattern][,pattern:=NULL]

       id   grp
   <char> <int>
1:      a     1
2:      b     2
3:      c     1

Prior solution:

dt[dt[, .(paste0(sort(c(value1,value2)), collapse=",")), by=id] %>% 
     .[,pattern:=.GRP, by=V1] %>% 
     .[,V1:=NULL], on=.(id)]

Output:

       id value1 value2 pattern
   <char>  <num>  <num>   <int>
1:      a      1      0       1
2:      a      1      3       1
3:      b      1      0       2
4:      b      2      3       2
5:      c      1      0       1
6:      c      1      3       1
Witha answered 22/3, 2022 at 21:32 Comment(1)
Thanks for joining in. I really like using paste() to create a string from the set of values and then identifying the groups based on this string. I have added a benchmark to the question that looks into your solution, and it seems to be the best. I have adapted it a bit to not rely on unique(), but I tested my adaption and your solution and they perfomed basically the same.Kevon
A
1

With toString, as suggested by data.table error message when using a list as by :

Column or expression 1 of 'by' is type 'list' which is not currently supported.
As a workaround, consider converting the column to a supported type, e.g. by=sapply(list_col, toString)

dt <- dt[, .(data = list(.SD)), by = id]
dt[, pattern_id :=.GRP, by = sapply(data, toString)]
dt[,unlist(data,recursive=F),by=.(id,pattern_id)]

       id pattern_id value1 value2
   <char>      <int>  <num>  <num>
1:      a          1      1      0
2:      a          1      1      3
3:      b          2      1      0
4:      b          2      2      3
5:      c          1      1      0
6:      c          1      1      3

However, this is slower than match.

Ava answered 22/3, 2022 at 21:33 Comment(0)
H
1

Assuming each id is repeated twice, "reshape" - convert 2x2 into 1x4 columns. Then get group ID using .GRP by grouping by all columns excluding id:

res <- dt[, c(.SD[ 1 ], .SD[ 2 ]), by = id]
setnames(res, make.unique(colnames(res)))
res[, pattern_id := .GRP, by = res[, -1] ][, .(id, pattern_id)]
#             id pattern_id
#      1:      1          1
#      2:      2          2
#      3:      3          3
#      4:      4          4
#      5:      5          5
#    ---                  
# 499996: 499996       1010
# 499997: 499997       3175
# 499998: 499998       3996
# 499999: 499999       3653
# 500000: 500000       4217

Using the bigger dataset takes about half a second.


Edit: another version using dcast, but it is 8x slower:

res <- dcast(dt, id ~ value1 + value2, length)
res[, pattern_id :=.GRP, by = res[, -1] ][, .(id, pattern_id)]
Healing answered 22/3, 2022 at 22:50 Comment(1)
Hi zx8754, unfortunately the ids don't always come in twos, as you asked in the original question. Each id can be used to identify a set of values with an arbitrary number of rows. That would be a great solution otherwise!Kevon
U
0

How about reshaping wider and using paste0()?

library(dplyr)
library(tidyr)

dt <- dt %>% group_by(id) %>%
  mutate(inst = row_number(id)) %>% 
  pivot_wider(values_from = c(value1, value2),
              names_from = inst) %>% 
  mutate(pattern_id = paste0(value1_1, value1_2, value2_1, value2_2))
Underpass answered 22/3, 2022 at 21:24 Comment(1)
Hi Andrea. The problem (which I added to the question, btw) is that each id is not limited to necessarily two rows. In my case an id can identify a set of values with an arbitrary number of rows.Kevon

© 2022 - 2024 — McMap. All rights reserved.