Merging a lot of data.frames [duplicate]

About

Asked 31/12, 2012 at 2:31 Answered 31/12, 2012 at 2:40

Possible Duplicate:
Merge multiple data frames in a list simultaneously

example data.frames:

 df1 = data.frame(id=c('1','73','2','10','43'),v1=c(1,2,3,4,5)) <br>
 df2 = data.frame(id=c('7','23','57','2','62','96'),v2=c(1,2,3,4,5,6)) <br>
 df3 = data.frame(id=c('23','62'),v3=c(1,2)) <br>

Note: id is unique for each data.frame. I want the resulting matrix to look like

1      1 NA NA 
2      3  4 NA 
7      NA 1 NA 
10     4 NA NA 
23     NA 2  1 
43     5 NA NA 
57     NA 3 NA 
62     NA 5  2 
73     2 NA NA 
96     NA 6 NA

In this case, I only show 3 datasets, I actually have at least 22 of them so at the end I want a matrix of nx(22+1) where n is the number of ids for all 22 datasets.

Given 2 datasets, I need to get their ids in the first column and 2nd and 3rd columns are filled with the values, if there is no value exists, then input NA instead.

Unconventionality answered 31/12, 2012 at 2:31 Comment(2)

This is not a solution but in addition to what is stated by @Matthew Plourde. You can build list of data.frames: df_list <- lapply(paste0("df",1:22), as.name). – Pomp 31/12, 2012 at 9:2

Even though this thread may be duplicate of another, but both questions and answers are presented in a more readable way. – Carmel 15/2, 2016 at 9:54

203

Put them into a list and use merge with Reduce

Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
#    id v1 v2 v3
# 1   1  1 NA NA
# 2  10  4 NA NA
# 3   2  3  4 NA
# 4  43  5 NA NA
# 5  73  2 NA NA
# 6  23 NA  2  1
# 7  57 NA  3 NA
# 8  62 NA  5  2
# 9   7 NA  1 NA
# 10 96 NA  6 NA

You can also use this more concise version:

Reduce(function(...) merge(..., all=TRUE), list(df1, df2, df3))

Smallscale answered 31/12, 2012 at 2:40 Comment(17)

+1 for Reduce. For this simple example, this is equivalent to merge(merge(df1, df2, by='id', all=T), df3, by='id', all=T). Clearly a loop could be used, iterating through the data frames -- but that's exactly what Reduce does. – Crenate 31/12, 2012 at 2:40

I'm thankful Reduce is in the language, but I really wish it were more like the *apply functions, letting you give it additional arguments for the functional supplied. I hate that I have to embed a function definition just to use merge with all=TRUE. – Smallscale 31/12, 2012 at 2:48

merge_recurse and merge_all from the (older) reshape package are a decent guide for how to build something that does this for you in a more convenient form. – Benniebenning 31/12, 2012 at 2:52

Can I ask how to modify the function if we would have to account for different ids across those data frames? – Crispate 6/7, 2015 at 15:45

I think it would be easiest to standardize the id column names. The hack to handle this with Reduce would get kind of obscure. – Smallscale 6/7, 2015 at 15:48

I tried this method and it is really slow with 500 lists each with 125 rows. Are there any other fast methods – Ohm 7/9, 2015 at 13:19

@PollaA.Fattah take a look at the join functions in the dplyr package – Smallscale 7/9, 2015 at 13:21

@MatthewPlourde Thank you for your reply it turns out my problem was simpler than what I thought of and I used rbind eventually. – Ohm 7/9, 2015 at 14:2

It was fast (immediate) with 8 data frames with ~3000 rows each – Metamathematics 16/2, 2016 at 23:29

I am using merge and reduce, but geeting the following error: Error: cannot allocate vector of size 2.5 Gb, please help – Alphitomancy 2/1, 2017 at 4:31

@gauravkumar Welcome to the world of big data. You'll want to check out the CRAN task view on high-performance computing (cran.r-project.org/web/views/HighPerformanceComputing.html), especially the section called "Large memory and out-of-memory data" – Celia 3/11, 2017 at 23:8

How to handle if one of tables is null? It affects the results but should be automatically excluded – Microphyte 26/1, 2018 at 23:35

@Microphyte just filter first – Smallscale 28/1, 2018 at 23:4

I faced a warning when by = 0 saying the "Row.name" is duplicated. It turned out that the merge function does something stupid when trying to merge based on row.names. The solution I found is to move the row.names into a column and use that to merge. Hopefully this is useful to those who are desperately looking for an answer about the warning and misbehave of merge. – Adley 12/6, 2020 at 22:12

@MehradMahmoudian how did you do that? – Behest 29/11, 2021 at 19:47

@Behest basically just create a new column for each dataframe/matrix and fill it with the rownames (e.g df$rownames <- row.names(df)) and then while merging use by = "rownames". – Adley 1/12, 2021 at 10:20

I wonder how to use the suffixes argument from merge.data.table when multiple dataframes have columns with the same name, not counting the matching column. – Diphenylhydantoin 9/2, 2022 at 17:3

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags