I have a dataframe that identifies a set of values with an id:
library(data.table)
dt <- data.table(
id = rep(c("a", "b", "c"), each = 2),
value1 = c(1, 1, 1, 2, 1, 1),
value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#> id value1 value2
#> 1: a 1 0
#> 2: a 1 3
#> 3: b 1 0
#> 4: b 2 3
#> 5: c 1 0
#> 6: c 1 3
As you can see, the ids a
and c
identify both the same set of values. So I want to create a "pattern id", that identifies the set of values associated with the ids a
and c
(obs: an id might identify more than two rows, I just limited them to two rows here for the sake of simplicity).
I did manage to come up with a solution using nested data.tables and match()
:
dt <- dt[, .(data = list(.SD)), by = id]
unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]
dt
#> id pattern_id
#> 1: a 1
#> 2: b 2
#> 3: c 1
It does the trick, but it is not as fast as I'd like it to be. match()
documentation is pretty clear regarding its efficiency with lists:
Matching for lists is potentially very slow and best avoided except in simple cases.
As you can see, I don't need the actual pattern data in my final result, only a table that associates the ids to the pattern ids. I feel like nesting the data, using it to match and then dropping it afterwards is a bit wasteful, but not sure if there's a better way. I was thinking in something that transform each dataframe into a string, or, even better, something that avoided the nesting altogether, but I couldn't come up with anything better than what I have now.
I have created a bigger dataset to play around with and test different solutions:
set.seed(0)
size <- 1000000
dt <- data.table(
id = rep(1:(size / 2), each = 2),
value1 = sample(1:10, size, replace = TRUE),
value2 = sample(1:10, size, replace = TRUE)
)
fastmatch::fmatch
– Brisco