Simultaneously merge multiple data.frames in a list
Asked Answered
C

9

343

I have a list of many data.frames that I want to merge. The issue here is that each data.frame differs in terms of the number of rows and columns, but they all share the key variables (which I've called "var1" and "var2" in the code below). If the data.frames were identical in terms of columns, I could merely rbind, for which plyr's rbind.fill would do the job, but that's not the case with these data.

Because the merge command only works on 2 data.frames, I turned to the Internet for ideas. I got this one from here, which worked perfectly in R 2.7.2, which is what I had at the time:

merge.rec <- function(.list, ...){
    if(length(.list)==1) return(.list[[1]])
    Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}

And I would call the function like so:

df <- merge.rec(my.list, by.x = c("var1", "var2"), 
                by.y = c("var1", "var2"), all = T, suffixes=c("", ""))

But in any R version after 2.7.2, including 2.11 and 2.12, this code fails with the following error:

Error in match.names(clabs, names(xi)) : 
  names do not match previous names

(Incidently, I see other references to this error elsewhere with no resolution).

Is there any way to solve this?

Cherenkov answered 11/11, 2011 at 8:16 Comment(0)
G
321

Another question asked specifically how to perform multiple left joins using dplyr in R . The question was marked as a duplicate of this one so I answer here, using the 3 sample data frames below:

x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE)
y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE)
z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE)

The answer is divided in three sections representing three different ways to perform the merge. You probably want to use the purrr way if you are already using the tidyverse packages. For comparison purposes below, you'll find a base R version using the same sample dataset.


1) Join them with reduce from the purrr package:

The purrr package provides a reduce function which has a concise syntax:

library(tidyverse)
list(x, y, z) %>% reduce(left_join, by = "i")
#  A tibble: 3 x 4
#  i       j     k     l
#  <chr> <int> <int> <int>
# 1 a      1    NA     9
# 2 b      2     4    NA
# 3 c      3     5     7

You can also perform other joins, such as a full_join or inner_join:

list(x, y, z) %>% reduce(full_join, by = "i")
# A tibble: 4 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 a     1     NA     9
# 2 b     2     4      NA
# 3 c     3     5      7
# 4 d     NA    6      8

list(x, y, z) %>% reduce(inner_join, by = "i")
# A tibble: 1 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 c     3     5     7

2) dplyr::left_join() with base R Reduce():

list(x,y,z) %>%
    Reduce(function(dtf1,dtf2) left_join(dtf1,dtf2,by="i"), .)

#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7

3) Base R merge() with base R Reduce():

And for comparison purposes, here is a base R version of the left join based on Charles's answer.

 Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),
        list(x,y,z))
#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7
Goins answered 21/12, 2015 at 10:22 Comment(6)
The full_join variant works perfectly, and looks a lot less scary than the accepted answer. Not much of a speed difference, though.Cherenkov
@Axeman is right, but you might be able to avoid (visibly) returning a list of data frames at all by using map_dfr() or map_dfc()Volley
I though I could join a number of DF based on a pattern using ´ls(pattern = "DF_name_contains_this" )´, but no. Used ´noquote( paste(())´, but I'm still producing a character vector instead of a list of DF. I ended up typing the names, which is obnoxious.Afrit
Another question provides a python implementation: list of pandas data frames dfs = [df1, df2, df3] then reduce(pandas.merge, dfs).Goins
How can you add a suffix to avoid automatically appending of ".y" or ".x"?Lancet
@jgarces see the suffix section in help(left_join).Goins
A
243

Reduce makes this fairly easy:

merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)

Here's a fully example using some mock data:

set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
#    x  a  b         y
#12 12 NA 18        NA
#13 13 NA 19        NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352

And here's an example using these data to replicate my.list:

merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]

#  matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1   ALGIERE   200 RI      026       S         NA   <NA>   NA   NA   NA         NA   <NA>
#2     ALVES   100 RI      019       S         NA   <NA>   NA   NA   NA         NA   <NA>
#3    BADEAU   100 RI      032       S         NA   <NA>   NA   NA   NA         NA   <NA>

Note: It looks like this is arguably a bug in merge. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame which does make.unique the names, causing the rbind to fail.

# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y senate1995   name         votes.year  
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.

Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge. Eg:

my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
      names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))

The merge/Reduce will then work fine.

Asmara answered 11/11, 2011 at 17:12 Comment(11)
Thanks! I saw this solution also on the link from Ramnath. Looks easy enough. But I get the following error: "Error in match.names(clabs, names(xi)) : names do not match previous names". The variables I'm matching on are all present in all the dataframes in the list, so I'm not catching what this error is telling me.Cherenkov
I tested this solution on R2.7.2 and I get the same match.names error. So there's some more fundamental problem with this solution and my data. I used the code: Reduce(function(x, y) merge(x, y, all=T,by.x=match.by, by.y=match.by), my.list, accumulate=F)Cherenkov
Strange, I added the code that I tested it with which runs fine. I guess there is some field-renaming occurring based on the merge args you're using? The merged result must still have the relevant keys in order to be merged with the subsequent data frame.Asmara
I suspect something happening with empty data frames. I tried out some examples like this: empty <- data.frame(x=numeric(0),a=numeric(0); L3 <- c(empty,empty,list.of.data.frames,empty,empty,empty) and got some weird stuff happening that I haven't figured out yet.Danella
@Asmara You're onto something. Your code runs fine above for me. And when I adapt it to mine, it runs fine too -- except that it does a merge ignoring the key variables I want. When I try to add key variables rather than leave them out, I get a new error "Error in is.null(x) : 'x' is missing". The code line is "test.reduce <- Reduce(function(...) merge(by=match.by, all=T), my.list)" where match.by are the vector of key variable names I want merged by.Cherenkov
@BenBolker No it can't be empty data frames; your code isn't right. It should be L3 <- list(empty,empty, data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)),empty,empty,empty) and then m3 = Reduce(function(...) merge(..., all=T), L3) works just fine.Cherenkov
@Asmara Sorry that last code line is wrong. When properly rewritten as test.reduce <- Reduce(function(...) merge(..., by=match.by, all=T), my.list) I get the same old "match.names" error.Cherenkov
Still not able to replicate the problem - see updated answer. Can you provide better sample data? Maybe just save(my.list, file='my.list.RData') and upload?Asmara
@Asmara You've gone above and beyond -- thanks. I uploaded replication code and real data that is accessed via url to show you the problem. Thanks for showing me pastebin.Cherenkov
The reason it worked without match.by is because it wasn't doing field renaming on name, but rather including that in key.Asmara
@Asmara Wow; this did it! Both "Recall" and "Reduce" solutions work fine now in 2.12. Thank you. I've never really run into a bug in core R code before... I do wonder why "Recall" worked in 2.7.2 but not now.Cherenkov
S
60

You can do it using merge_all in the reshape package. You can pass parameters to merge using the ... argument

reshape::merge_all(list_of_dataframes, ...)

Here is an excellent resource on different methods to merge data frames.

Spragens answered 11/11, 2011 at 15:24 Comment(8)
looks like I just replicated merge_recurse =) good to know this function already exists.Alkalify
yes. whenever i have an idea, i always check if @hadley has already done it, and most of the times he has :-)Spragens
I'm a little confused; should I do merge_all or merge_recurse? In any case, when I try to add in my additional arguments to either, I get the error "formal argument "all" matched by multiple actual arguments".Cherenkov
@bshor. it would be useful to post a few lines of your original data frames, so that your error is reproducible. you can easily do it using dput.Spragens
I think I dropped this from reshape2. Reduce + merge is just as simple.Gymnosophist
@Spragens Yikes. My list has 19 data frames, each about 48-50 rows and 600 columns! dput puts tons of data on screen. What's the best way to summarize?Cherenkov
@Spragens I updated the original post with my attempt to use a real example from my code, but shortening up the data frames for exposition.Cherenkov
@Ramnath, link is dead, is there a mirror?Claimant
B
8

We can use {powerjoin}.

Borrowing sample data from accepted answer:

x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE)
y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE)
z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE)

library(powerjoin)
power_full_join(list(x,y,z), by = "i")
#>   i  j  k  l
#> 1 a  1 NA  9
#> 2 b  2  4 NA
#> 3 c  3  5  7
#> 4 d NA  6  8

power_left_join(list(x,y,z), by = "i")
#>   i j  k  l
#> 1 a 1 NA  9
#> 2 b 2  4 NA
#> 3 c 3  5  7

You might also start with a dataframe and join a list of data frames, for the same result


power_full_join(x, list(y,z), by = "i")
#>   i  j  k  l
#> 1 a  1 NA  9
#> 2 b  2  4 NA
#> 3 c  3  5  7
#> 4 d NA  6  8
Brogan answered 3/3, 2019 at 12:44 Comment(0)
A
6

You can use recursion to do this. I haven't verified the following, but it should give you the right idea:

MergeListOfDf = function( data , ... )
{
    if ( length( data ) == 2 ) 
    {
        return( merge( data[[ 1 ]] , data[[ 2 ]] , ... ) )
    }    
    return( merge( MergeListOfDf( data[ -1 ] , ... ) , data[[ 1 ]] , ... ) )
}
Alkalify answered 11/11, 2011 at 15:13 Comment(0)
S
5

I will reuse the data example from @PaulRougieux

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)

Here's a short and sweet solution using purrr and tidyr

library(tidyverse)

 list(x, y, z) %>% 
  map_df(gather, key=key, value=value, -i) %>% 
  spread(key, value)
Slumber answered 28/7, 2017 at 10:59 Comment(0)
H
1

I had a list of dataframes with no common id column.
I had missing data on many dfs. There were Null values. The dataframes were produced using table function. The Reduce, Merging, rbind, rbind.fill, and their like could not help me to my aim. My aim was to produce an understandable merged dataframe, irrelevant of the missing data and common id column.

Therefore, I made the following function. Maybe this function can help someone.

##########################################################
####             Dependencies                        #####
##########################################################

# Depends on Base R only

##########################################################
####             Example DF                          #####
##########################################################

# Example df
ex_df           <- cbind(c( seq(1, 10, 1), rep("NA", 0), seq(1,10, 1) ), 
                         c( seq(1, 7, 1),  rep("NA", 3), seq(1, 12, 1) ), 
                         c( seq(1, 3, 1),  rep("NA", 7), seq(1, 5, 1), rep("NA", 5) ))

# Making colnames and rownames
colnames(ex_df) <- 1:dim(ex_df)[2]
rownames(ex_df) <- 1:dim(ex_df)[1]

# Making an unequal list of dfs, 
# without a common id column
list_of_df      <- apply(ex_df=="NA", 2, ( table) )

it is following the function

##########################################################
####             The function                        #####
##########################################################


# The function to rbind it
rbind_null_df_lists <- function ( list_of_dfs ) {
  length_df     <- do.call(rbind, (lapply( list_of_dfs, function(x) length(x))))
  max_no        <- max(length_df[,1])
  max_df        <- length_df[max(length_df),]
  name_df       <- names(length_df[length_df== max_no,][1])
  names_list    <- names(list_of_dfs[ name_df][[1]])

  df_dfs <- list()
  for (i in 1:max_no ) {

    df_dfs[[i]]            <- do.call(rbind, lapply(1:length(list_of_dfs), function(x) list_of_dfs[[x]][i]))

  }

  df_cbind               <- do.call( cbind, df_dfs )
  rownames( df_cbind )   <- rownames (length_df)
  colnames( df_cbind )   <- names_list

  df_cbind

}

Running the example

##########################################################
####             Running the example                 #####
##########################################################

rbind_null_df_lists ( list_of_df )
Hole answered 17/10, 2018 at 12:32 Comment(0)
I
1

Here is a generic wrapper which can be used to convert a binary function to multi-parameters function. The benefit of this solution is that it is very generic and can be applied to any binary functions. You just need to do it once and then you can apply it any where.

To demo the idea, I use simple recursion to implement. It can be of course implemented with more elegant way that benefits from R's good support for functional paradigm.

fold_left <- function(f) {
return(function(...) {
    args <- list(...)
    return(function(...){
    iter <- function(result,rest) {
        if (length(rest) == 0) {
            return(result)
        } else {
            return(iter(f(result, rest[[1]], ...), rest[-1]))
        }
    }
    return(iter(args[[1]], args[-1]))
    })
})}

Then you can simply wrap any binary functions with it and call with positional parameters (usually data.frames) in the first parentheses and named parameters in the second parentheses (such as by = or suffix =). If no named parameters, leave second parentheses empty.

merge_all <- fold_left(merge)
merge_all(df1, df2, df3, df4, df5)(by.x = c("var1", "var2"), by.y = c("var1", "var2"))

left_join_all <- fold_left(left_join)
left_join_all(df1, df2, df3, df4, df5)(c("var1", "var2"))
left_join_all(df1, df2, df3, df4, df5)()
Indium answered 13/5, 2020 at 13:56 Comment(0)
H
0

When you have a list of dfs, and a column contains the "ID", but in some lists, some IDs are missing, then you may use this version of Reduce / Merge in order to join multiple Dfs of missing Row Ids or labels:

Reduce(function(x, y) merge(x=x, y=y, by="V1", all.x=T, all.y=T), list_of_dfs)
Hole answered 12/9, 2019 at 13:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.