Is it possible to do a full join in dplyr and keep all the columns used in the join?
Asked Answered
N

4

5

I have two tables that I want to do a full join using dplyr, but I don't want it to drop any of the columns. Per the documentation and my own experience it is only keeping the join column for the left hand side. This is a problem when you have a row with a record for the right hand side since the join value is gone.

For example, suppose I have the two tables a and b,

customerId | revenue               customerId | state
-----------|---------              -----------|-------
    1      | 2000                       1     |  CA
    2      | 3000                       3     |  GA
    4      | 4000                       4     |  NY

doing something like full_join(a, b, by="customerId") will produce

customerId | revenue | state
-----------|---------|-------
    1      |   2000  |  CA
    2      |   3000  | <NA>
   <NA>    |   <NA>  |  GA
    4      |   4000  |  NY

so there is no way to tell which customer that third row is from. The ideal output would be

customerId.a | customerId.b | revenue | state
-------------|--------------|---------|-------
      1      |      1       |   2000  |  CA
      2      |     <NA>     |   3000  | <NA>
    <NA>     |      3       |   <NA>  |  GA
      4      |      4       |   4000  |  NY

note that this is just a toy example. I'm actually using sparklyr so this is all being run in Spark. Thus, merge won't work here for me. Is there a way to do what I'm looking for in dplyr?

EDIT: As someone pointed out this actually is working as desired in dplyr itself locally. However, I do see this problem using sparklyr (which uses dplyr). Here is the code to see that:

library(sparklyr)
sc <- spark_connect("local[4]")
d1 <- data_frame(customerId = c("1","2","4"), revenue=c(2000,3000,4000))
d2 <- data_frame(customerId = c("1","3","4"), state=c("CA", "GA", "NY"))
d1_tbl <- copy_to(sc, d1)
d2_tbl <- copy_to(sc, d2)
full_join(d1_tbl, d2_tbl, by=c("customerId"))
Never answered 5/5, 2017 at 15:52 Comment(1)
Filed an issue here: github.com/rstudio/sparklyr/issues/663. Thanks!Genovese
C
5

I can't reproduce your problem. All the IDs should be (and are) included in the full join.

library(data_frame)
d1 <- data_frame(
  customerId = c(1, 2, 4),
  revenue = c(2000, 3000, 4000)
)
d2 <- data_frame(
  customerId = c(1, 3, 4),
  state = c("CA", "GA", "NY")
)

full_join(d1, d2, by = "customerId")
## # A tibble: 4 × 3
##   customerId revenue state
##        <dbl>   <dbl> <chr>
## 1          1    2000    CA
## 2          2    3000  <NA>
## 3          4    4000    NY
## 4          3      NA    GA

Update: I can reproduce the problem using sparklyr. It's weird behavior, so you might want to file an issue. (Unclear whether the problem is with sparklyr or dplyr or DBI or Spark SQL though.)

Using explain(), you can see the SQL that was generated.

full_join(d1_tbl, d2_tbl, by=c("customerId")) %>% explain()

You can try running a custom SQL query to get what you want, though it's a little messier.

library(DBI)
qry <- "SELECT 
    d1.customerID AS customerID1, 
    d2.customerID AS customerID2, 
    d1.revenue, 
    d2.state 
  FROM d1 
  FULL JOIN d2 
    ON d1.customerId = d2.customerId"
dbGetQuery(sc, qry)  
##   customerID1 customerID2 revenue state
## 1           1           1    2000    CA
## 2           2        <NA>    3000  <NA>
## 3        <NA>           3     NaN    GA
## 4           4           4    4000    NY
Celestina answered 5/5, 2017 at 15:57 Comment(1)
Thank you for pointing that out. I should have tried it myself first. Very interesting. This is definitely not the behavior I'm seeing using sparklyr. I'll create a code example to demonstrate and update my question.Never
O
10

You can create separate identical customerId for both data frames before join:

full_join(
    mutate(a, customerId.a = customerId), 
    mutate(b, customerId.b = customerId), 
    by="customerId"
) %>% select(-customerId)

#  revenue customerId.a state customerId.b
#1    2000            1    CA            1
#2    3000            2  <NA>           NA
#3    4000            4    NY            4
#4      NA           NA    GA            3
Opponent answered 5/5, 2017 at 16:0 Comment(2)
I like this idea and may give it a try as a work around. However as Richie points out this does work inside dplyr itself. The behavior is different though using sparklyr.Never
This should be the accepted answer - the question asks about keeping all of the columns from both tables, which you would need if you wanted to check which rows didn't join, not just one column that contains all of the values of the join key. This answer is by far the simplest way to get to that.Bogie
C
5

I can't reproduce your problem. All the IDs should be (and are) included in the full join.

library(data_frame)
d1 <- data_frame(
  customerId = c(1, 2, 4),
  revenue = c(2000, 3000, 4000)
)
d2 <- data_frame(
  customerId = c(1, 3, 4),
  state = c("CA", "GA", "NY")
)

full_join(d1, d2, by = "customerId")
## # A tibble: 4 × 3
##   customerId revenue state
##        <dbl>   <dbl> <chr>
## 1          1    2000    CA
## 2          2    3000  <NA>
## 3          4    4000    NY
## 4          3      NA    GA

Update: I can reproduce the problem using sparklyr. It's weird behavior, so you might want to file an issue. (Unclear whether the problem is with sparklyr or dplyr or DBI or Spark SQL though.)

Using explain(), you can see the SQL that was generated.

full_join(d1_tbl, d2_tbl, by=c("customerId")) %>% explain()

You can try running a custom SQL query to get what you want, though it's a little messier.

library(DBI)
qry <- "SELECT 
    d1.customerID AS customerID1, 
    d2.customerID AS customerID2, 
    d1.revenue, 
    d2.state 
  FROM d1 
  FULL JOIN d2 
    ON d1.customerId = d2.customerId"
dbGetQuery(sc, qry)  
##   customerID1 customerID2 revenue state
## 1           1           1    2000    CA
## 2           2        <NA>    3000  <NA>
## 3        <NA>           3     NaN    GA
## 4           4           4    4000    NY
Celestina answered 5/5, 2017 at 15:57 Comment(1)
Thank you for pointing that out. I should have tried it myself first. Very interesting. This is definitely not the behavior I'm seeing using sparklyr. I'll create a code example to demonstrate and update my question.Never
G
2

This has been fixed

> full_join(d1_tbl, d2_tbl, by="customerId")
# Source:   lazy query [?? x 3]
# Database: spark_connection
  customerId revenue state
       <chr>   <dbl> <chr>
1          1    2000    CA
2          3     NaN    GA
3          2    3000  <NA>
4          4    4000    NY
Genovese answered 16/5, 2017 at 18:5 Comment(0)
S
0

Just use the parameter keep in dplyr::full_join().

full_join(a, b, by="customerId",keep = TRUE)
Spiritualist answered 1/10, 2021 at 23:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.