R data.table column names not working within a function
Asked Answered
G

3

10

I am trying to use a data.table within a function, and I am trying to understand why my code is failing. I have a data.table as follows:

DT <- data.table(my_name=c("A","B","C","D","E","F"),my_id=c(2,2,3,3,4,4))
> DT
   my_name my_id
1:       A     2
2:       B     2
3:       C     3
4:       D     3
5:       E     4
6:       F     4

I am trying to create all pairs of "my_name" with different values of "my_id", which for DT would be:

Var1 Var2    
A    C
A    D
A    E
A    F
B    C
B    D
B    E
B    F
C    E
C    F
D    E
D    F

I have a function to return all pairs of "my_name" for a given pair of values of "my_id" which works as expected.

get_pairs <- function(id1,id2,tdt) {
    return(expand.grid(tdt[my_id==id1,my_name],tdt[my_id==id2,my_name]))
}
> get_pairs(2,3,DT)
Var1 Var2
1    A    C
2    B    C
3    A    D
4    B    D

Now, I want to execute this function for all pairs of ids, which I try to do by finding all pairs of ids and then using mapply with the get_pairs function.

> combn(unique(DT$my_id),2)
     [,1] [,2] [,3]
[1,]    2    2    3
[2,]    3    4    4
tid1 <- combn(unique(DT$my_id),2)[1,]
tid2 <- combn(unique(DT$my_id),2)[2,]
mapply(get_pairs, tid1, tid2, DT)
Error in expand.grid(tdt[my_id == id1, my_name], tdt[my_id == id2, my_name]) : 
  object 'my_id' not found

Again, if I try to do the same thing without an mapply, it works.

get_pairs3(tid1[1],tid2[1],DT)
Var1 Var2
1    A    C
2    B    C
3    A    D
4    B    D

Why does this function fail only when used within an mapply? I think this has something to do with the scope of data.table names, but I'm not sure.

Alternatively, is there a different/more efficient way to accomplish this task? I have a large data.table with a third id "sample" and I need to get all of these pairs for each sample (e.g. operating on DT[sample=="sample_id",] ). I am new to the data.table package, and I may not be using it in the most efficient way.

Giesecke answered 25/6, 2015 at 13:34 Comment(4)
Sorry, I'm not sure about why the mapply is not working and so didn't mention it in my answer.Poetry
for mapply, it works if you put DT directly into the function and not as parameter (although it doesn't solve the "why is it not working" part...)Liberalize
Does each id always have exactly two names?Poetry
Each id may have one more names, and either ids or names may be duplicated. Additionally name,id pairs are not guaranteed to be unique.Giesecke
N
3

Why does this function fail only when used within an mapply? I think this has something to do with the scope of data.table names, but I'm not sure.

The reason the function is failing has nothing to do with scoping in this case. mapply vectorizes the function, it takes each element of each parameter and passes to the function. So, in your case, the data.table elements are its columns, so mapply is passing the column my_name instead of the complete data.table.

If you want to pass the complete data.table to mapply, you should use the MoreArgs parameter. Then your function will work:

res <- mapply(get_pairs, tid1, tid2, MoreArgs = list(tdt=DT), SIMPLIFY = FALSE)
do.call("rbind", res)
  Var1 Var2
1     A    C
2     B    C
3     A    D
4     B    D
5     A    E
6     B    E
7     A    F
8     B    F
9     C    E
10    D    E
11    C    F
12    D    F
Nethermost answered 25/6, 2015 at 14:46 Comment(2)
Ah ok, that makes sense. Is this because a data.table is also a list()?Giesecke
@Giesecke Yeap, data.tables, data.frames, tbl_dfs are lists with some additional properties.Nethermost
P
4

Enumerate all possible pairs

u_name    <- unique(DT$my_name)
all_pairs <- CJ(u_name,u_name)[V1 < V2]

Enumerate observed pairs

obs_pairs <- unique(
  DT[,{un <- unique(my_name); CJ(un,un)[V1 < V2]}, by=my_id][, !"my_id"]
)

Take the difference

all_pairs[!J(obs_pairs)]

CJ is like expand.grid except that it creates a data.table with all of its columns as its key. A data.table X must be keyed for a join X[J(Y)] or a not-join X[!J(Y)] (like the last line) to work. The J is optional, but makes it more obvious that we're doing a join.


Simplifications. @CathG pointed out that there is a cleaner way of constructing obs_pairs if you always have two sorted "names" for each "id" (as in the example data): use as.list(un) in place of CJ(un,un)[V1 < V2].

Poetry answered 25/6, 2015 at 14:26 Comment(2)
Sorry, I did not mention that there may be duplicates in "my_name" but your solution works if there are no duplicates. This is much more elegant than my approach though. Clearly I need to learn to use joins more.Giesecke
@Giesecke I've edited for that case now (if I understand it correctly).Poetry
S
4

The function debugonce() is extremely useful in these scenarios.

debugonce(mapply)
mapply(get_pairs, tid1, tid2, DT)

# Hit enter twice
# from within BROWSER
debugonce(FUN)
# Hit enter twice
# you'll be inside your function, and then type DT
DT
# [1] "A" "B" "C" "D" "E" "F"
Q # (to quit debugging mode)

which is wrong. Basically, mapply() takes the first element of each input argument and passes it to your function. In this case you've provided a data.table, which is also list. So, instead of passing the entire data.table, it's passing each element of the list (columns).

So, you can get around this by doing:

mapply(get_pairs, tid1, tid2, list(DT))

But mapply() simplifies the result by default, and therefore you'd get a matrix back. You'll have to use SIMPLIFY = FALSE.

mapply(get_pairs, tid1, tid2, list(DT), SIMPLIFY = FALSE)

Or simply use Map:

Map(get_pairs, tid1, tid2, list(DT))

Use rbindlist() to bind the results.

HTH

Signal answered 25/6, 2015 at 14:50 Comment(0)
N
3

Why does this function fail only when used within an mapply? I think this has something to do with the scope of data.table names, but I'm not sure.

The reason the function is failing has nothing to do with scoping in this case. mapply vectorizes the function, it takes each element of each parameter and passes to the function. So, in your case, the data.table elements are its columns, so mapply is passing the column my_name instead of the complete data.table.

If you want to pass the complete data.table to mapply, you should use the MoreArgs parameter. Then your function will work:

res <- mapply(get_pairs, tid1, tid2, MoreArgs = list(tdt=DT), SIMPLIFY = FALSE)
do.call("rbind", res)
  Var1 Var2
1     A    C
2     B    C
3     A    D
4     B    D
5     A    E
6     B    E
7     A    F
8     B    F
9     C    E
10    D    E
11    C    F
12    D    F
Nethermost answered 25/6, 2015 at 14:46 Comment(2)
Ah ok, that makes sense. Is this because a data.table is also a list()?Giesecke
@Giesecke Yeap, data.tables, data.frames, tbl_dfs are lists with some additional properties.Nethermost

© 2022 - 2024 — McMap. All rights reserved.