Multiple joins/merges with data.tables
Asked Answered
H

1

7

I have two data.tables, DT and L:

> DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9,key="x")
> L=data.table(yv=c(1L:8L,12L),lu=c(letters[8:1],letters[12]),key="yv")

> DT
   x y v
1: a 1 1
2: a 3 2
3: a 6 3
4: b 1 4
5: b 3 5
6: b 6 6
7: c 1 7
8: c 3 8
9: c 6 9

> L
   yv lu
1:  1  h
2:  2  g
3:  3  f
4:  4  e
5:  5  d
6:  6  c
7:  7  b
8:  8  a
9: 12  l

I would like to independently look up the corresponding value of lu from L for column y and for column v in DT. The following syntax provides the correct result, but is cumbersome to generate and then understand at a glance later:

> L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv.1,v=yv,lu.1=lu.1,lu.2=lu)]
   x y v lu.1 lu.2
1: a 1 1    h    h
2: a 2 3    g    f
3: a 3 6    f    c
4: b 4 1    e    h
5: b 5 3    d    f
6: b 6 6    c    c
7: c 7 1    b    h
8: c 8 3    a    f
9: c 9 6   NA    c

(Edit: original post had L[setkey(L[setkey(DT,y)],v)][,list(x,y=yv,v=yv.1,lu.1=lu,lu.2=lu.1)] above, which incorrectly mixed up the y and v columns and looked up values.)

In SQL this would be simple/straightforward:

SELECT DT.*, L1.lu AS lu1, L2.lu AS lu2
FROM DT
LEFT JOIN L AS L1 ON DT.y = L1.yv
LEFT JOIN L AS L2 ON DT.v = L2.yv

Is there a more elegant way to use data.table to perform multiple joins? Note that I'm joining one table to another table twice in this example, but I am also interested in joining one table to multiple different tables.

Hanshansard answered 2/1, 2013 at 17:23 Comment(0)
G
7

Great question. One trick is that i doesn't have to be keyed. Only x must be keyed.

There might be better ways. How about this:

> cbind( L[DT[,list(y)]], L[DT[,list(v)]], DT )
   yv lu yv lu x y v
1:  1  h  1  h a 1 1
2:  3  f  2  g a 3 2
3:  6  c  3  f a 6 3
4:  1  h  4  e b 1 4
5:  3  f  5  d b 3 5
6:  6  c  6  c b 6 6
7:  1  h  7  b c 1 7
8:  3  f  8  a c 3 8
9:  6  c  9 NA c 6 9

or, to illustrate, this is the same :

> cbind( L[J(DT$y)], L[J(DT$v)], DT )
   yv lu yv lu x y v
1:  1  h  1  h a 1 1
2:  3  f  2  g a 3 2
3:  6  c  3  f a 6 3
4:  1  h  4  e b 1 4
5:  3  f  5  d b 3 5
6:  6  c  6  c b 6 6
7:  1  h  7  b c 1 7
8:  3  f  8  a c 3 8
9:  6  c  9 NA c 6 9

merge could also be used, if the following feature request was implemented :

FR#2033 Add by.x and by.y to merge.data.table

Gertie answered 2/1, 2013 at 17:51 Comment(10)
Thanks for the cbind tip and the quick response. This method is less nested and easier to read than my original try. To clean up the resulting column names, I'll use something like cbind(DT, L[J(DT$y)][,list(lu.1=lu)], L[J(DT$v)])[,list(x,y,v,lu.1,lu.2=lu)]. My background in SQL has conditioned me to never rely on the row sort order, so combining the columns feels wrong, but obviously it works in R. I need to be "thinking in R" more. Huge fan of your work, BTW.Hanshansard
Would implementing FR#2033 allow multiple tables to be merged? As I understand it, merge is used for two data frames, not many. I see some options like merge_all and merge_recurse in reshape, but I can't find too many examples (especially with several sets of foreign keys) so I'm not sure how to use them if they would even work in this example.Hanshansard
Generally in R and in data.table we don't use as many tables. Factors are similar to foreign Keys in sql, and these days since character is cached by R automatically in its global cache, character is really efficient. Big flat tables, denormalised, is good generally. Data.table only groups the columns that j needs so there is no disadvantage of many columns like there is in (row-stored) sql.Gertie
Actually, thinking about it, X[Y[Z]] or X[Y][Z] are ways to join multiple tables.Gertie
Whenever i say "no disadvantage", i don't mean it generally. I mean that within the context of that comment.Gertie
But yes you can rely on row order in R, that's guaranteed unlike you're used to in sql. But the cbind method probably falls over if you have multiple matches or no matches, so might need to embelish that to set mult and nomatch appropriately, to ensure 1-1. Thats where X[Y[Z]] might be better.Gertie
That might be a bit quick to accept. I know I'm the main author but others often have better answers than I do :). Accepting takes it off the radar. Also, dont forget you can always use SQL as-is in sqldf.Gertie
I think my original try was using the nested X[Y[Z]] format, but it looks messy since I was using the setkeys to take care of the factors/foreign keys. As for denormalizing, I have 30M+ rows so I thought it would be more manageable to keep in memory if I normalized it, but maybe there's no advantage. Yeah, I could use sqldf but I think it is slower, right? (Plus, I like learning something new if it is potentially better.) Ok, if you say so I'll un-accept your answer :).Hanshansard
Oh yes, your original without setkey is that, isn't it. If you replace the setkey with j=list() as I showed you'll get less columns in the final result, though. To solve that you can place more columns in j=list(...) than are used in the join, and rename them at the same time, then they should carry through. Admittedly this area is more tricky than sql, but then you have more control to join in different orders, with guaranteed row order at each stage (no need to self join and order by).Gertie
Btw, 30m rows is fairly small, depends on number of columns of course. It's GB that count. A 10GB data.table on a 64GB ram machine is standard. People use data.table up to the 2 billion row limit (although that limit has been removed in R3).Gertie

© 2022 - 2024 — McMap. All rights reserved.