Join a dataframe with multiple dataframes stored in a list with different column

Asked 7/2, 2023 at 16:25 Answered 7/2, 2023 at 17:53

I have a dataframe (df1) and a list of dataframes (test) like below; I want to join df1 with each of the datafraems in test and populate a new column (X), while keeping all the other records intact.

read.table(text = "Fruits      A    B     C     D
                   Apple       10   1.3   NA    NA
                   Orange      0.2  NA    0.21  NA
                   Grape       NA   0.06  51    0.7
                   Grape       NA   0.06  51    0.7
                   Grape       1    0.06  51    0.7
                   Grape       NA   0.06  NA    0.8
                   Berry       11   20    0.3   0.04
                   Apple       NA   1.1   0.5   NA   
                   Apple       NA   1.2   0.5   NA
                   Apple       NA   1.3   0.1   NA
                   Berry       NA   NA    0.3   0.04
                   Berry       1    NA    0.9   0.01
                   Apple       1    1.3   0.5   NA
                   Apple       1    1.3   0.5   NA", 
            stringsAsFactors = FALSE, header = TRUE) -> df1

list(data.frame(Fruits = c("Apple"), A = 10, X = "oh"),
     data.frame(Fruits = c("Berry"), A = 11, B = 20, X = "duh")) -> test

Here's the expected output:

   Fruits    A     B     C    D    X
1   Apple 10.0  1.30    NA   NA   oh
2  Orange  0.2    NA  0.21   NA   NA   
3   Grape   NA  0.06 51.00 0.70   NA
4   Grape   NA  0.06 51.00 0.70   NA
5   Grape  1.0  0.06 51.00 0.70   NA
6   Grape   NA  0.06    NA 0.80   NA
7   Berry 11.0 20.00  0.30 0.04   duh
8   Apple   NA  1.10  0.50   NA   NA
9   Apple   NA  1.20  0.50   NA   NA
10  Apple   NA  1.30  0.10   NA   NA
11  Berry   NA    NA  0.30 0.04   NA
12  Berry  1.0    NA  0.90 0.01   NA
13  Apple  1.0  1.30  0.50   NA   NA
14  Apple  1.0  1.30  0.50   NA   NA

Simply looping through the dataframes within test does not work since it creates a dataframe for each ..._join and also creates duplicated rows for the second iteration. Maybe we can use a conditional mutate.

purrr::map(test, ~full_join(df1, .x))

It is more than likely I am missing something simple, but I don't want to join the outputs of full_join afterwards since my actual df1 has over 1M rows.

Valid answered 7/2, 2023 at 16:25 Comment(0)

I would use purrr::reduce() instead of map(). But this raises the issue that after the first iteration, X appears in both dataframes and is treated as a key. One workaround would be to give all the X columns unique names, then coalesce after joining.

library(dplyr)
library(purrr)

test2 <- imap(test, ~ rename(.x, "X{.y}" := X))

test2 %>% 
  reduce(full_join, .init = df1) %>% 
  mutate(X = coalesce(X1, X2), .keep = "unused")

   Fruits    A     B     C    D    X
1   Apple 10.0  1.30    NA   NA   oh
2  Orange  0.2    NA  0.21   NA <NA>
3   Grape   NA  0.06 51.00 0.70 <NA>
4   Grape   NA  0.06 51.00 0.70 <NA>
5   Grape  1.0  0.06 51.00 0.70 <NA>
6   Grape   NA  0.06    NA 0.80 <NA>
7   Berry 11.0 20.00  0.30 0.04  duh
8   Apple   NA  1.10  0.50   NA <NA>
9   Apple   NA  1.20  0.50   NA <NA>
10  Apple   NA  1.30  0.10   NA <NA>
11  Berry   NA    NA  0.30 0.04 <NA>
12  Berry  1.0    NA  0.90 0.01 <NA>
13  Apple  1.0  1.30  0.50   NA <NA>
14  Apple  1.0  1.30  0.50   NA <NA>

If test has a lot of elements, it'll be annoying to list out coalesce(X1, X2, X3, ..., Xn). In that case, you can use this alternative:

test2 %>% 
  reduce(full_join, .init = df1) %>% 
  mutate(X = coalesce(!!!syms(paste0("X", seq_along(test2)))), .keep = "unused")

Impassable answered 7/2, 2023 at 16:35 Comment(0)

We can use

library(powerjoin)
library(dplyr)
test %>% 
   reduce(power_full_join, .init = df1, conflict = coalesce_xy)

Or as @moodymudskipper mentioned in the comments, power_full_join does join recursively

power_full_join(df1, test, conflict = coalesce_xy)

Irritate answered 7/2, 2023 at 17:53 Comment(0)

It seems like this may be a good use of the new rows_update() method. You can iterate over the list of rows to update inside a reduce to apply them sequentially. For example

purrr::reduce(test, function(data, match) {
  rows_update(data, match, setdiff(names(match), "X"))
}, .init=data.frame(df1, X=NA_character_))

This adds a column named X which is all NA at first, and then at each iteration it updates the value of X using whatever overlapping columns there are that are not named "X".

Georas answered 7/2, 2023 at 16:43 Comment(3)

Error in `rows_check_key_df()`: ! `x` key values are not unique. – Valid 7/2, 2023 at 17:12

@Valid Did you get that error with the sample data provided? It worked for me. Tested with dplyr_1.1.0 – Georas 7/2, 2023 at 17:49

Nah, it works for the sample since there are only one row per each match. I already upvoted your answer, but I went with the other answer since it provides a solution for more complicated cases. Thank you. – Valid 7/2, 2023 at 17:51

library(dplyr)
library(purrr)

test2 <- imap(test, ~ rename(.x, "X{.y}" := X))

test2 %>% 
  reduce(full_join, .init = df1) %>% 
  mutate(X = coalesce(X1, X2), .keep = "unused")

   Fruits    A     B     C    D    X
1   Apple 10.0  1.30    NA   NA   oh
2  Orange  0.2    NA  0.21   NA <NA>
3   Grape   NA  0.06 51.00 0.70 <NA>
4   Grape   NA  0.06 51.00 0.70 <NA>
5   Grape  1.0  0.06 51.00 0.70 <NA>
6   Grape   NA  0.06    NA 0.80 <NA>
7   Berry 11.0 20.00  0.30 0.04  duh
8   Apple   NA  1.10  0.50   NA <NA>
9   Apple   NA  1.20  0.50   NA <NA>
10  Apple   NA  1.30  0.10   NA <NA>
11  Berry   NA    NA  0.30 0.04 <NA>
12  Berry  1.0    NA  0.90 0.01 <NA>
13  Apple  1.0  1.30  0.50   NA <NA>
14  Apple  1.0  1.30  0.50   NA <NA>

If test has a lot of elements, it'll be annoying to list out coalesce(X1, X2, X3, ..., Xn). In that case, you can use this alternative:

test2 %>% 
  reduce(full_join, .init = df1) %>% 
  mutate(X = coalesce(!!!syms(paste0("X", seq_along(test2)))), .keep = "unused")

Impassable answered 7/2, 2023 at 16:35 Comment(0)

Recommended topics

Hot tags