left_join two data frames and overwrite
Asked Answered
D

4

17

I'd like to merge two data frames where df2 overwrites any values that are NA or present in df1. Merge data frames and overwrite values provides a data.table option, but I'd like to know if there is a way to do this with dplyr. I've tried all of the _join options but none seem to do this. Is there a way to do this with dplyr?

Here is an example:

df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4)) 
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))

Desired output:

  y x1
1 A  5
2 B  6
3 C  7
4 D  4
Duly answered 25/2, 2016 at 20:0 Comment(2)
left_join(df1, df2, by="y") %>% transmute(y, x1 = ifelse(is.na(x1.y), x1.x, x1.y))? Perhaps this could be added to the other answer and this one closed as dupe? (The other is not a data.table specific question)Xylene
I think the previous comment might drop any columns that are not y or x1 from the final output, which is most likely undesirable. So here's a slight change so it only updates the x1 values and leaves df1 otherwise intact: left_join(df1, df2, by="y") %>% mutate(x1 = ifelse(is.na(x1.y), x1.x, x1.y)) %>% select(!c(x1.x, x1.y))Maloriemalory
E
12

I think what you want is to keep the values of df2 and only add the ones in df1 that are not present in df2 which is what anti_join does:

"anti_join return all rows from x where there are not matching values in y, keeping just columns from x."

My solution:

df3 <- anti_join(df1, df2, by = "y") %>% bind_rows(df2)

Warning messages:
1: In anti_join_impl(x, y, by$x, by$y) :
  joining factors with different levels, coercing to character vector
2: In rbind_all(x, .id) : Unequal factor levels: coercing to character

> df3
Source: local data frame [4 x 2]

      y    x1
  (chr) (dbl)
1     D     4
2     A     5
3     B     6
4     C     7

this line gives the desired output (in a different order) but, you should pay attention to the warning message, when working with your dataset be sure to read y as a character variable.

Ethiop answered 26/2, 2016 at 3:34 Comment(1)
Thanks! Finally! My background is not in computer science and it was oddly hard for me to find this answer. What should I have been looking for? R merge dataframes with priority or R complete dataframe from other dataframe if values not present or R merge without overwrite or R add rows if not present etc... were all unsuccessful...Tollman
S
3

This is the idiom I now use, as, in addition, it handles keeping columns that are not part of the update table. I use some different names than from the OP, but the flavor is similar.

The one thing I do is create a variable for the keys used in the join, as I use that in a few spots. But otherwise, it does what is desired.

In itself it doesn't handle the action of, for example, "update this row if a value is NA", but you should exercise that condition when creating the join table.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

.keys <- c("key1", "key2")

.base_table <- tribble(
    ~key1, ~key2, ~val1, ~val2,
    "A", "a", 0, 0,
    "A", "b", 0, 1,
    "B", "a", 1, 0,
    "B", "b", 1, 1)

.join_table <- tribble(
    ~key1, ~key2, ~val2,
    "A", "b", 100,
    "B", "a", 111)

# This works
df_result <- .base_table %>%
    # Pull off rows from base table that match the join table
    semi_join(.join_table, .keys) %>%
    # Drop cols from base table that are in join table, except for the key columns
    select(-matches(setdiff(names(.join_table), .keys))) %>%
    # Left join on the join table columns
    left_join(.join_table, .keys) %>%
    # Remove the matching rows from the base table, and bind on the newly joined result from above.
    bind_rows(.base_table %>% anti_join(.join_table, .keys))

df_result %>%
    print()
#> # A tibble: 4 x 4
#>   key1  key2   val1  val2
#>   <chr> <chr> <dbl> <dbl>
#> 1 A     b         0   100
#> 2 B     a         1   111
#> 3 A     a         0     0
#> 4 B     b         1     1

Created on 2019-12-12 by the reprex package (v0.3.0)

Solingen answered 12/12, 2019 at 19:31 Comment(0)
S
1

The {dplyr} package now has built-in tools to do this and the like... See here: https://dplyr.tidyverse.org/reference/rows.html

library(dplyr)

df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4)) 
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))

rows_update(df1, df2, "y")
#>   y x1
#> 1 A  5
#> 2 B  6
#> 3 C  7
#> 4 D  4

Created on 2024-06-10 with reprex v2.1.0

Solingen answered 10/6, 2024 at 19:48 Comment(0)
D
0

Here is my modification of the previous answer for a slightly different situation.

Suppose I have a dataset (iris) where I wish to overwrite some values in some columns ("Sepal.Length", "Sepal.Width") and keep the rest, while using another dataset as a source of data, and use one variable (Species) to determine which values to overwrite.

# have a look
unique(iris$Sepal.Length[iris$Species == "setosa"])

# create table with values and conditions to overwrite
iris_fix <- tribble(
  ~Sepal.Length, ~Sepal.Width, ~Species,
  40,            20,           "setosa",
  50,            30,           "virginica")
# note that I am keeping "versicolor" untouched

# variables to overwrite
keys <- c("Sepal.Length", "Sepal.Width")

# subset the original by condition
df_result <- iris %>%
    semi_join(iris_fix, by = "Species") 

# remove columns with values that need to be replaced
# keep the rest
df_result <- df_result %>%
    select(-keys) 
    
# add new values to the subset 
df_result <- df_result %>%
    left_join(iris_fix, by = "Species") 

# now remove subset we want to replace from the original
# and add our intermediate table with updated values
df_result <- df_result %>%
   bind_rows(iris %>% anti_join(iris_fix, by = "Species"))

#check
unique(df_result$Sepal.Length[df_result$Species == "setosa"])
Dyann answered 11/3, 2024 at 2:6 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.