Merging a lot of data.frames [duplicate]
Asked Answered
U

1

88

Possible Duplicate:
Merge multiple data frames in a list simultaneously

example data.frames:

 df1 = data.frame(id=c('1','73','2','10','43'),v1=c(1,2,3,4,5)) <br>
 df2 = data.frame(id=c('7','23','57','2','62','96'),v2=c(1,2,3,4,5,6)) <br>
 df3 = data.frame(id=c('23','62'),v3=c(1,2)) <br>

Note: id is unique for each data.frame. I want the resulting matrix to look like

1      1 NA NA 
2      3  4 NA 
7      NA 1 NA 
10     4 NA NA 
23     NA 2  1 
43     5 NA NA 
57     NA 3 NA 
62     NA 5  2 
73     2 NA NA 
96     NA 6 NA

In this case, I only show 3 datasets, I actually have at least 22 of them so at the end I want a matrix of nx(22+1) where n is the number of ids for all 22 datasets.

Given 2 datasets, I need to get their ids in the first column and 2nd and 3rd columns are filled with the values, if there is no value exists, then input NA instead.

Unconventionality answered 31/12, 2012 at 2:31 Comment(2)
This is not a solution but in addition to what is stated by @Matthew Plourde. You can build list of data.frames: df_list <- lapply(paste0("df",1:22), as.name).Pomp
Even though this thread may be duplicate of another, but both questions and answers are presented in a more readable way.Carmel
S
203

Put them into a list and use merge with Reduce

Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
#    id v1 v2 v3
# 1   1  1 NA NA
# 2  10  4 NA NA
# 3   2  3  4 NA
# 4  43  5 NA NA
# 5  73  2 NA NA
# 6  23 NA  2  1
# 7  57 NA  3 NA
# 8  62 NA  5  2
# 9   7 NA  1 NA
# 10 96 NA  6 NA

You can also use this more concise version:

Reduce(function(...) merge(..., all=TRUE), list(df1, df2, df3))
Smallscale answered 31/12, 2012 at 2:40 Comment(17)
+1 for Reduce. For this simple example, this is equivalent to merge(merge(df1, df2, by='id', all=T), df3, by='id', all=T). Clearly a loop could be used, iterating through the data frames -- but that's exactly what Reduce does.Crenate
I'm thankful Reduce is in the language, but I really wish it were more like the *apply functions, letting you give it additional arguments for the functional supplied. I hate that I have to embed a function definition just to use merge with all=TRUE.Smallscale
merge_recurse and merge_all from the (older) reshape package are a decent guide for how to build something that does this for you in a more convenient form.Benniebenning
Can I ask how to modify the function if we would have to account for different ids across those data frames?Crispate
I think it would be easiest to standardize the id column names. The hack to handle this with Reduce would get kind of obscure.Smallscale
I tried this method and it is really slow with 500 lists each with 125 rows. Are there any other fast methodsOhm
@PollaA.Fattah take a look at the join functions in the dplyr packageSmallscale
@MatthewPlourde Thank you for your reply it turns out my problem was simpler than what I thought of and I used rbind eventually.Ohm
It was fast (immediate) with 8 data frames with ~3000 rows eachMetamathematics
I am using merge and reduce, but geeting the following error: Error: cannot allocate vector of size 2.5 Gb, please helpAlphitomancy
@gauravkumar Welcome to the world of big data. You'll want to check out the CRAN task view on high-performance computing (cran.r-project.org/web/views/HighPerformanceComputing.html), especially the section called "Large memory and out-of-memory data"Celia
How to handle if one of tables is null? It affects the results but should be automatically excludedMicrophyte
@Microphyte just filter firstSmallscale
I faced a warning when by = 0 saying the "Row.name" is duplicated. It turned out that the merge function does something stupid when trying to merge based on row.names. The solution I found is to move the row.names into a column and use that to merge. Hopefully this is useful to those who are desperately looking for an answer about the warning and misbehave of merge.Adley
@MehradMahmoudian how did you do that?Behest
@Behest basically just create a new column for each dataframe/matrix and fill it with the rownames (e.g df$rownames <- row.names(df)) and then while merging use by = "rownames".Adley
I wonder how to use the suffixes argument from merge.data.table when multiple dataframes have columns with the same name, not counting the matching column.Diphenylhydantoin

© 2022 - 2024 — McMap. All rights reserved.