Update subset of data.table based on join
Asked Answered
G

3

22

I have two data tables, DT1 and DT2:

set.seed(1)
DT1<-data.table(id1=rep(1:3,2),id2=sample(letters,6), v1=rnorm(6), key="id2")
DT1
##    id1 id2         v1
## 1:   2   e  0.7383247
## 2:   1   g  1.5952808
## 3:   2   j  0.3295078
## 4:   3   n -0.8204684
## 5:   3   s  0.5757814
## 6:   1   u  0.4874291

DT2<-data.table(id2=c("n","u"), v1=0, key="id2")
DT2
##    id2 v1
## 1:   n  0
## 2:   u  0

I would like to update DT1 based on a join with DT2, but only for a subset of DT1. For example, for DT1[id1==3], I would expect the value of v1 in row 4 to be updated as in the following result:

DT1
##    id1 id2         v1
## 1:   2   e  0.7383247
## 2:   1   g  1.5952808
## 3:   2   j  0.3295078
## 4:   3   n          0
## 5:   3   s  0.5757814
## 6:   1   u  0.4874291

I know how to update a table (using the := assignment operator), how to join the tables (DT1[DT2]), and how to subset a table (DT1[id1==3]). However I'm not sure how to do all three at once.

EDIT: Note that the original example only attempts to update one column, but my actual data requires updating many columns. Consider the additional scenarios in DT1b and DT2b:

set.seed(2)
DT1b<-DT1[,v2:=rnorm(6)] # Copy DT1 and add a new column
setkey(DT1b,id2)
DT1b
##    id1 id2         v1          v2
## 1:   2   e  0.7383247 -0.89691455
## 2:   1   g  1.5952808  0.18484918
## 3:   2   j  0.3295078  1.58784533
## 4:   3   n -0.8204684 -1.13037567
## 5:   3   s  0.5757814 -0.08025176
## 6:   1   u  0.4874291  0.13242028

DT2b<-rbindlist(list(DT2,data.table(id2="e",v1=0))) # Copy DT2 and add a new row
DT2b[,v2:=-1] # Add a new column to DT2b
setkey(DT2b,id2)
DT2b
##    id2 v1 v2
## 1:   e  0 -1
## 2:   n  0 -1
## 3:   u  0 -1

Based on the helpful answers from @nmel and @BlueMagister, I came up with this solution for the updated scenario:

DT1b[DT2b[DT1b[id1 %in% c(1,2)],nomatch=0],c("v1","v2"):=list(i.v1,i.v2)]
DT1b
##    id1 id2         v1          v2
## 1:   2   e  0.0000000 -1.00000000
## 2:   1   g  1.5952808  0.18484918
## 3:   2   j  0.3295078  1.58784533
## 4:   3   n -0.8204684 -1.13037567
## 5:   3   s  0.5757814 -0.08025176
## 6:   1   u  0.0000000 -1.00000000
Gaea answered 6/2, 2013 at 3:20 Comment(0)
S
18

The easiest way I can think of is to key by id1 as well. eg

setkey(DT1, id2,id1)
DT2[, id1 := 3]
setkey(DT2, id2, id1)

# use i.v1 to reference v1 from the i component
DT1[DT2, v1 := i.v1 ]


DT1
   id1 id2        v1
1:   2   e 0.7383247
2:   1   g 1.5952808
3:   2   j 0.3295078
4:   3   n 0.0000000
5:   3   s 0.5757814
6:   1   u 0.4874291
Siple answered 6/2, 2013 at 3:41 Comment(1)
Thanks for your answer and other comments, @mnel. This approach works for the original example and can be easily extended for the case when multiple columns need to be updated rather than just one. But if the subset is more complicated (e.g. id1==3 | id1 ==2; see the updated question), it seems like adding the id1 key to DT2 gets more complicated.Gaea
S
10

This is similar to mnel's solution but uses ifelse instead of a second key.

DT1[DT2, v1  := ifelse(id1==3, i.v1, v1),nomatch=0]
Sincere answered 6/2, 2013 at 4:26 Comment(5)
Thanks for your answer, @BlueMagister. This definitely works for the example. How would you modify your answer if the tables required updating many columns, rather than just one? Extending it to two columns, I think the middle part of your solution would be something like c("v1","v2"):=list(ifelse(id1==3,i.v1,v1),ifelse(id1==3,i.v2,v2)).Gaea
@dnlbrky, the problem with the multiple ifelse statements is that you have to evaluate id1==3 and ifelse numerous times. If you subset as part of the keying (a binary search) it will be far more efficient.Siple
Yes, @mnel, I should have explicitly stated that my attempt at extending the solution from @BlueMagister was not efficient, and that I was fishing for something better than my attempt. I like that this approach from @BlueMagister is flexible in terms of adding more conditions to the subset (for example, id1==3 | id==2). I've updated my question to include the scenario of multiple columns and multiple subset conditions.Gaea
I found it very helpful for left-joining on subset of left data.table. Data.table offers faster version of ifelse(), namely fifelse().Letsou
nomatch is redundant here (because update joins are automatically left-outer)Bantling
R
2

I have been thinking of this question these days. Following is my solution.

DT1[DT2, names(DT2):= DT2, on= 'id']

Or, if you don't want to add new variables to DT1 when there are variables private to DT2:

common.var <- intersect(names(DT1), names(DT2))
DT1[DT2, c(common.var) := DT2[, common.var, with= FALSE] ,on= 'id']
Rockies answered 16/11, 2020 at 15:21 Comment(1)
This isn't a solution to the question asked, which is about restricting the update join to a subset of DT1Bantling

© 2022 - 2024 — McMap. All rights reserved.