How to remove duplicate columns after dplyr join?
Asked Answered
F

7

9

Consider two dataframes, df1 and df2.
df1 has columns id, a, b.
df2 has columns id, a, c.

I want to perform a left join such that the combined dataframe has columns id, a, b, c.

combined <- df1 %>% left_join(df2, by="id")

But in the combined dataframe, the columns are id, a.x, b, a.y, c.

I can include "a" in the join key (i.e: left_join(df1, df2, by=c("id", "a"))) but there are too many of columns like a. I want to join only by the primary key id and drop all the duplicated columns in df2.

Fusiform answered 6/5, 2020 at 5:21 Comment(0)
B
23

I like to do things in as few steps as possible. I think this would reduce the numbers of steps:

combine<-df1%>%
  left_join(df2, by="id", suffix=c("",".y")%>%
  select(-ends_with(".y"))

The minus sign in the select command means that you want to select everything EXCEPT those variables. If you want to delete ALL of the duplicated columns (no column a at all), you could do this:

combine<-df1%>%
  left_join(df2, by="id", suffix=c(".x",".y")%>%
  select(-ends_with(".x"),-ends_with(".y"))
Beare answered 6/10, 2021 at 20:31 Comment(1)
This is the best solution, so it should be the accepted answer. Between, you forgot a closing parenthesis before the last pipe symbol.Lichter
F
5

First we perform the join by id

combined <- df1 %>% left_join(df2, by="id")

Then we rename those with .x and drop those with .y

combined <- combined %>% 
  rename_at(
    vars(ends_with(".x")),
    ~str_replace(., "\\..$","")
  ) %>% 
  select_at(
    vars(-ends_with(".y"))
  )
Fusiform answered 6/5, 2020 at 5:21 Comment(1)
You can skip the rename of the .x variables by using left_join(df2, by = "id", suffix = c("", ".y")Hinkle
F
4

The more generic approach would be to drop the columns before left join otherwise your combined dataset can be very large initially:

df1<- data.frame(id= seq(1:0), a=rnorm(1:10,0.2),b=rpois(10,0.2))
df2<- data.frame(id= seq(1:0), a=rnorm(1:10,0.2),c=rnorm(10,0.2))

varList<- names(df2)[!(names(df2) %in% names(df1))] # get non common names
varList<- c(varList,"id") # appending key parameter

combined <- df1 %>% left_join((df2 %>% select(varList)), by="id")

combined dataset will not have any .x or .y

Fredkin answered 6/5, 2020 at 5:41 Comment(1)
combined <- df1 %>% left_join((df2 %>% select(all_of(varList)), by="id") considering: tidyselect.r-lib.org/reference/faq-external-vector.htmlMarable
H
1

I think this is the simplest way to achieve what you're trying to do

df <- left_join(df1, df2, by = "id", suffix = c("", ".annoying_duplicate_column")) %>%
  select(-ends_with(".annoying_duplicate_column"))

(Combining the answer by @Ernest Han with the very helpful comment by @David T above)

Head answered 1/8, 2021 at 16:3 Comment(0)
F
1

A simply solution could be:

df <- dplyr::inner_join(
   df1,
   dplyr::select(df2, -any_of(names(df1)), id),
   by = "id"
)
  • names(df1) will create a vector of all the df1 names.

  • any_of() is a tidyselect helper that will select any columns that are contained in df2. (See https://tidyselect.r-lib.org/reference/all_of.html).

  • The "-" before any_of() means remove that column(s).

    In other words, it will remove the columns that are already present in df2.

  • Add the primary key after the any_of() to keep the "id" variable in df2.

Fighter answered 28/3, 2023 at 9:28 Comment(0)
C
0

By "there are too many of columns like a" do you mean you want to find all the columns which are common to both sources? In that case, why not literally use the intersection (the default behaviour)?

## two data.frames, only id = 3, a = 4 matches
(df1 <- data.frame(id = 1:3, a = 2:4, b = 3:5))
#>   id a b
#> 1  1 2 3
#> 2  2 3 4
#> 3  3 4 5
(df2 <- data.frame(id = 3:2, a = 4:5, c = 1:2))
#>   id a c
#> 1  3 4 1
#> 2  2 5 2

## this produces a.x and a.y                  
dplyr::left_join(df1, df2, by = "id")
#>   id a.x b a.y  c
#> 1  1   2 3  NA NA
#> 2  2   3 4   5  2
#> 3  3   4 5   4  1

## which columns are common?
intersect(names(df1), names(df2))
#> [1] "id" "a"

## this produces id, a, b, c
dplyr::left_join(df1, df2, by = intersect(names(df1), names(df2)))
#>   id a b  c
#> 1  1 2 3 NA
#> 2  2 3 4 NA
#> 3  3 4 5  1

## this is, however, the default behaviour for left_join
## i.e. use all columns which are present in both
dplyr::left_join(df1, df2)
#> Joining, by = c("id", "a")
#>   id a b  c
#> 1  1 2 3 NA
#> 2  2 3 4 NA
#> 3  3 4 5  1

Created on 2020-05-06 by the reprex package (v0.3.0)

Connatural answered 6/5, 2020 at 5:40 Comment(2)
The poster doesn't want to add at the common variable level, he just wants "id" as key for left join.Fredkin
This solution works fine (though less efficient) if those columns actually match. But they may not, in which case dropping them is needed.Straightedge
R
0

Great question!

I find removing the duplicate columns and or renaming after the join to be super cumbersome.

Here is a simple solution, similar to solution from @Andrés González Demori's solution.

from the first dataset(df1); select those columns not in second dataset(df2) and leave the key column to join on; leave second dataset(df2) columns as they are

library(tidyverse)

merge(
x = df1 %>% 
select(df1 %>% select_if(!names(.) %in% (names(df2))) %>% names, "keycolumn")  ,
y = df2,
by = "keycolumn"
)

This solution works if you use other functions (inner_join, left_join or merge as used in this example).

Hope this helps!

Rozamond answered 7/5 at 17:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.