How to merge two dataframes using multiple columns as key?
Asked Answered
D

2

5

Say I have the following dataframes:

DF1 <- data.frame("A" = rep(c("A","B"), 18),
                  "B" = rep(c("C","D","E"), 12),
                  "NUM"= rep(rnorm(36,10,1)),
                  "TEST" = rep(NA,36))

DF2 <- data.frame("A" = rep("A",6),
                  "B" = rep(c("C","D"),6),
                  "VAL" = rep(c(1,3),3))

*Note: Each unique combination of variables A and B in DF2 should have a unique VAL.

For each row, I would like to replace the NA in TEST with the corresponding value of VAL in DF1 if the values in columns A and A match and the values in columns B and B match for that row. Otherwise, I'd leave TEST as NA. How would I do this without looping through each combination using match?

Ideally, an answer would scale to two data frames with many columns to match upon.

Domingadomingo answered 23/3, 2015 at 14:22 Comment(6)
I think in your DF2 there are multiple rows that could match up to DF1 1st and 2nd columns. So, I am not sure which VAL should be taken. Also, based on the example, the dimensions of the two datasets are not the same.Hailee
Sorry, each unique combination in DF2 should have a unique VAL. I need to figure out a way to encode that in the example.Domingadomingo
In that case, there would be only two rows for DF2 i.e. DF2 <- data.frame(A=rep('A',2), B=c('C', 'D'), VAL=rnorm(2)) and merge will get the result merge(DF1, DF2, all=TRUE)Hailee
@goldisfine: That doesn't clarify things much. We need to know which VAL should be chosen for a given combination of A and B since there are multiple.Hydrocellulose
Now there is a unique value for each combination in DF2 so I think that there is only one possible option for VAL for each row in DF1. I guess DF2 could be reduced to only the unique combinations of the predictors.Domingadomingo
In Python pandas, you could just simply set a compound key using variables A,B. Then this is a simple left-join/merge.Scleroprotein
C
6

As Akrun mentioned in comments, your lookup table (DF2) needs to be reduced to just its unique A/B combinations. For your current dataframe, this isn't a problem, but you will need additional rules if there are multiple possible values for the same combination. From there, the solution is easy:

DF2.u <- unique(DF2)
DF3 <- merge(DF1, DF2.u, all = T)

Note that this will produce a new dataframe with an empty TEST column (all values NA), and a VAL column assigned from DF2. To do exactly what you wanted (replace TEST with VAL where possible), here is some slightly clunkier code:

DF1$TEST <- merge(DF1, DF2.u, all = T)$VAL

EDIT: in response to your question, you can boil down DF2 if necessary quite simple:

DF2$C <- c(1:12) #now unique() won't work
DF2.u <- unique(DF2[1:3])

 A B VAL
1 A C   1
2 A D   3
Corkwood answered 23/3, 2015 at 14:53 Comment(3)
In my actual example, I have columns in DF2 which are numeric, and so if I passed the entire DF then I would have a unique DF that was too large. Can I subset DF2 to rows which are only unique in the factor variables?Domingadomingo
Of course, the actual indices you take will vary based on what your extra columns are and how they are arrangedCorkwood
In Python pandas, you could just simply set a compound key using variables A,B. Then this is a simple left-join/merge.Scleroprotein
I
11
# this is your DF1    
DF1 <- data.frame("A" = rep(c("A","B"), 18),
                      "B" = rep(c("C","D","E"), 12),
                      "NUM"= rep(rnorm(36,10,1)),
                      "TEST" = rep(NA,36))

#this is a DF2 i created, with unique A, B, VAL
DF2 <- data.frame("A" = rep(c("A","B"),3),
                  "B" = rep(c("C","D","E"),2),
                  "VAL" = rep(1:6))

# and this is the answer of what i assume you want      
tmp <- merge(DF1,DF2, by=c("A","B"), all.x=TRUE, all.y=FALSE)
DF1[4] <- tmp[5]
Indefatigable answered 23/3, 2015 at 14:52 Comment(0)
C
6

As Akrun mentioned in comments, your lookup table (DF2) needs to be reduced to just its unique A/B combinations. For your current dataframe, this isn't a problem, but you will need additional rules if there are multiple possible values for the same combination. From there, the solution is easy:

DF2.u <- unique(DF2)
DF3 <- merge(DF1, DF2.u, all = T)

Note that this will produce a new dataframe with an empty TEST column (all values NA), and a VAL column assigned from DF2. To do exactly what you wanted (replace TEST with VAL where possible), here is some slightly clunkier code:

DF1$TEST <- merge(DF1, DF2.u, all = T)$VAL

EDIT: in response to your question, you can boil down DF2 if necessary quite simple:

DF2$C <- c(1:12) #now unique() won't work
DF2.u <- unique(DF2[1:3])

 A B VAL
1 A C   1
2 A D   3
Corkwood answered 23/3, 2015 at 14:53 Comment(3)
In my actual example, I have columns in DF2 which are numeric, and so if I passed the entire DF then I would have a unique DF that was too large. Can I subset DF2 to rows which are only unique in the factor variables?Domingadomingo
Of course, the actual indices you take will vary based on what your extra columns are and how they are arrangedCorkwood
In Python pandas, you could just simply set a compound key using variables A,B. Then this is a simple left-join/merge.Scleroprotein

© 2022 - 2024 — McMap. All rights reserved.