Is there a _merge indicator available after a merge?
Asked Answered
F

3

14

Is there a way to get the equivalent of a _merge indicator variable after a merge in dplyr?

Something similar to Pandas' indicator = True option that essentially tells you how the merge went (how many matches from each dataset, etc).

Here is an example in Pandas

import pandas as pd

df1 = pd.DataFrame({'key1' : ['a','b','c'], 'v1' : [1,2,3]})
df2 = pd.DataFrame({'key1' : ['a','b','d'], 'v2' : [4,5,6]})

match = df1.merge(df2, how = 'left', indicator = True)

Here, after a left join between df1 and df2, you want to immediately know how many rows in df1 found a match in df2 and how many of them did not

match
Out[53]: 
  key1  v1   v2     _merge
0    a   1  4.0       both
1    b   2  5.0       both
2    c   3  NaN  left_only

and I can tabulate this merge variable:

match._merge.value_counts()
Out[52]: 
both          2
left_only     1
right_only    0
Name: _merge, dtype: int64

I don't see any option available after a, say, left join in dplyr

key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key1,v1)
df2 = data.frame(key2,v2)

> left_join(df1,df2, by = c('key1' = 'key2'))
  key1 v1 v2
1    a  1  4
2    b  2  5
3    c  3 NA

Am I missing something here? Thanks!

Fredric answered 18/10, 2016 at 14:15 Comment(2)
I guess you can create an indicator variable in 'x' and 'y' i.e. x$merge <- 1; y$merge; left_join(x,y, by = "key") if I understand the problem.Dunnock
with a left join you wouldn't expect to see any right onlySchuss
D
3

We create the 'merge' column based on inner_join, anti_join and then bind the rows with bind_rows

d1 <- inner_join(df1, df2, by = c('key1' = 'key2')) %>%
                    mutate(merge = "both")  
bind_rows(d1, anti_join(df1, df2, by = c('key1' = 'key2')) %>% 
             mutate(merge = 'left_only'))
Dunnock answered 18/10, 2016 at 14:54 Comment(2)
thanks that great but I believe this should be included as an option. thanks!Chewning
@Noobie You can raise a feature request in dplyr to include that in the next releaseDunnock
M
8

Stata similarly creates a new variable _merge when doing any type of merge or join. I too find it helpful to have as an option in order to diagnose a merge quickly after performing it.

For the last few months I've been using basic functions I wrote that simply embellish the dplyr joins. There are probably more efficient ways of doing this, but here is an example of one that embellishes full_join. If you set the option .merge = T you'll get a variable, called .merge that is similar to _merge in Stata or Pandas. (This also just prints off a diagnostic message about how many cases matched and didn't match each time you use it.) I know you already have an answer to the question, but if you want a function you can use repeatedly that works identically to full_join in dplyr, here is a start. You obviously need dplyr loaded to make this work...

full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
                        .merge = FALSE, ...){

# Checking to make sure used variable names are not already in use
if(".x_tracker" %in% names(x)){
    message("Warning: variable .x_tracker in left data was dropped")
}
if(".y_tracker" %in% names(y)){
    message("Warning: variable .y_tracker in right data was dropped")
}
if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
    stop("Variable .merge already exists; change name before proceeding")
}

# Adding simple merge tracker variables to data frames
x[, ".x_tracker"] <- 1
y[, ".y_tracker"] <- 1

# Doing full join
joined <- full_join(x, y, by = by, suffix = suffix,  ...)

# Calculating merge diagnoses 
matched <- joined %>%
    filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()
unmatched_x <- joined %>%
    filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
    NROW()
unmatched_y <- joined %>%
    filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()

# Print merge diagnoses
message(
    unmatched_x, " Rows ONLY from left data frame", "\n",
    unmatched_y, " Rows ONLY from right data frame", "\n",
    matched, " Rows matched"
)

# Create .merge variable if specified
if(.merge){
    joined <- joined %>%
        mutate(.merge = 
                   case_when(
                       !is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
                       is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
                       TRUE ~ "matched"
                       )
               )
}

# Dropping tracker variables and returning data frame
joined <- joined %>%
    select(-.x_tracker, -.y_tracker)
return(joined)
}

As an example:

data1 <- data.frame(x = 1:10, y = rnorm(10))
data2 <- data.frame(x = 4:20, z = rnorm(17))
full_join_track(data1, data2, .merge = T)
Mascot answered 18/10, 2016 at 19:52 Comment(3)
This is pretty great. Thanks!Preeminence
@mmann1123. I actually put this function and a couple others into a package called reviewr that you can access on my github page. It has some more functionality than the function above (it's called full_join_qc) there and the package contains other functions that print off diagnostics that I find helpful. For example, a function called filter_qc prints the number of rows dropped when performing a filter; mutate_qc prints the number of NAs created when making a new variable.Mascot
If you download the whole package you can access the vignette I wrote that shows how to use these functions. And each function has help documentation. In short, they all work identically to their dplyr counterparts but print more diagnostic information and in some cases create new variables for you like the equivalent of Stata's _merge. To download the package, use devtools::install_github("adamMaier/reviewr", build_vignettes = T). Once you do that you can add te package to your library like any others: library(reviewr)Mascot
D
3

We create the 'merge' column based on inner_join, anti_join and then bind the rows with bind_rows

d1 <- inner_join(df1, df2, by = c('key1' = 'key2')) %>%
                    mutate(merge = "both")  
bind_rows(d1, anti_join(df1, df2, by = c('key1' = 'key2')) %>% 
             mutate(merge = 'left_only'))
Dunnock answered 18/10, 2016 at 14:54 Comment(2)
thanks that great but I believe this should be included as an option. thanks!Chewning
@Noobie You can raise a feature request in dplyr to include that in the next releaseDunnock
C
0

I'm pretty stunned that what the OP is asking for isn't already an option. [sad face] I'm new to R, so I like this really simple solution. Here's an example:

# create 2 data frames to merge  
set.seed(12345)
dfx <- data.frame(x  = sample(1:20, size=10))
dfy <- data.frame(x  = sample(1:20, size=10))

df<-merge(x=dfx %>% mutate(ix=1),
          y=dfy %>% mutate(iy=1),
          on='x',
          all=TRUE) %>% 
  # now construct a merge indicator: 
  mutate(mer= case_when(ix==1&iy==1~'both',ix==1~'only_x',iy==1~'only_y')) 

df %>% count(mer)

#this produces a tabulation that looks like this: 
     mer n
1   both 4
2 only_x 6
3 only_y 6

Just remember that you might want to drop the columns that got generated (ix, iy, and mer)

Cheyennecheyne answered 9/7 at 15:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.