Combine a list of data frames into one data frame by row

Asked 17/5, 2010 at 17:38 Answered 1/2 at 3:6

500

I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame.

I got some pointers from an earlier question which was trying to do something similar but more complex.

Here's an example of what I am starting with (this is grossly simplified for illustration):

listOfDataFrames <- vector(mode = "list", length = 100)

for (i in 1:100) {
    listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T),
                             b=rnorm(500), c=rnorm(500))
}

I am currently using this:

  df <- do.call("rbind", listOfDataFrames)

Shoshana answered 17/5, 2010 at 17:38 Comment(4)

Also see this question: #2209758 – Antecedency 17/5, 2010 at 17:39

The do.call("rbind", list) idiom is what I have used before as well. Why do you need the initial unlist ? – Dewdrop 17/5, 2010 at 17:43

can someone explain to me the difference between do.call("rbind", list) and rbind(list) - why are the outputs not the same? – Mistrot 20/8, 2016 at 14:30

@Mistrot Because do.call() does not return the arguments one by one, but uses a list to hold the arguments of the function. See https://www.stat.berkeley.edu/~s133/Docall.html – Biform 29/12, 2018 at 13:43

327

Use bind_rows() from the dplyr package:

bind_rows(list_of_dataframes, .id = "column_label")

Digitalis answered 27/2, 2018 at 20:5 Comment(4)

Nice solution. .id = "column_label" adds the unique row names based on the list element names. – Adz 29/4, 2018 at 20:49

since it's 2018 and dplyr is both fast and a solid tool to use, I've changed this to the accepted answer. The years, they fly by! – Shoshana 12/1, 2019 at 12:3

That was exactly what I needed!! – Assistance 29/8, 2020 at 8:39

Strange but it does not work properly with list of tibbles – Psychiatrist 21/12, 2020 at 23:28

196

One other option is to use a plyr function:

df <- ldply(listOfDataFrames, data.frame)

This is a little slower than the original:

> system.time({ df <- do.call("rbind", listOfDataFrames) })
   user  system elapsed 
   0.25    0.00    0.25 
> system.time({ df2 <- ldply(listOfDataFrames, data.frame) })
   user  system elapsed 
   0.30    0.00    0.29
> identical(df, df2)
[1] TRUE

My guess is that using do.call("rbind", ...) is going to be the fastest approach that you will find unless you can do something like (a) use a matrices instead of a data.frames and (b) preallocate the final matrix and assign to it rather than growing it.

Edit 1:

Based on Hadley's comment, here's the latest version of rbind.fill from CRAN:

> system.time({ df3 <- rbind.fill(listOfDataFrames) })
   user  system elapsed 
   0.24    0.00    0.23 
> identical(df, df3)
[1] TRUE

This is easier than rbind, and marginally faster (these timings hold up over multiple runs). And as far as I understand it, the version of plyr on github is even faster than this.

Antecedency answered 17/5, 2010 at 17:54 Comment(6)

rbind.fill in the latest version of plyr is considerably faster than do.call and rbind – Tallowy 18/5, 2010 at 0:34

interesting. for me rbind.fill was the fastest. Weird enough, do.call / rbind did not return identical TRUE, even if i could ne find a difference. The other two were equal but plyr was slower. – Partitive 29/11, 2010 at 15:32

I() could replace data.frame in your ldply call – Occultation 28/8, 2013 at 15:13

there's also melt.list in reshape(2) – Occultation 28/8, 2013 at 15:14

do.call(function(...) rbind(..., make.row.names=F), df) is useful if you don't want the automatically-generated unique rownames. – Emmery 16/3, 2018 at 2:47

bind_rows() is fastest according to rmd's answer and I think it is the most straight forward. It also has the feature of adding an id column – Organometallic 25/7, 2018 at 16:39

141

For the purpose of completeness, I thought the answers to this question required an update. "My guess is that using do.call("rbind", ...) is going to be the fastest approach that you will find..." It was probably true for May 2010 and some time after, but in about Sep 2011 a new function rbindlist was introduced in the data.table package version 1.8.2, with a remark that "This does the same as do.call("rbind",l), but much faster". How much faster?

library(rbenchmark)
benchmark(
  do.call = do.call("rbind", listOfDataFrames),
  plyr_rbind.fill = plyr::rbind.fill(listOfDataFrames), 
  plyr_ldply = plyr::ldply(listOfDataFrames, data.frame),
  data.table_rbindlist = as.data.frame(data.table::rbindlist(listOfDataFrames)),
  replications = 100, order = "relative", 
  columns=c('test','replications', 'elapsed','relative')
  )

                  test replications elapsed relative
4 data.table_rbindlist          100    0.11    1.000
1              do.call          100    9.39   85.364
2      plyr_rbind.fill          100   12.08  109.818
3           plyr_ldply          100   15.14  137.636

Countryandwestern answered 28/8, 2013 at 13:49 Comment(5)

Thank you so much for this -- I was pulling my hair out because my data sets were getting too big for ldplying a bunch of long, molten data frames. Anyways, I got an incredible speedup by using your rbindlist suggestion. – Etna 18/9, 2013 at 5:52

And one more for completeness: dplyr::rbind_all(listOfDataFrames) will do the trick as well. – Saxena 15/7, 2014 at 22:56

is there an equivalent to rbindlist but that append the data frames by column ? something like a cbindlist ? – Jessjessa 14/9, 2015 at 15:37

@Jessjessa There is a recent feature request: add function cbindlist – Pendergrass 26/2, 2018 at 13:26

I was also pulling my hair out because do.call() had been running on a list of data frames for 18 hours, and still hadn't finished, thank you!!! – Castillo 2/4, 2019 at 14:52

119

Code:

library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
plyr::rbind.fill(dflist),
dplyr::bind_rows(dflist),
data.table::rbindlist(dflist),
plyr::ldply(dflist,data.frame),
do.call("rbind",dflist),
times=1000)

ggplot2::autoplot(mb)

Session:

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

> packageVersion("plyr")
[1] ‘1.8.4’
> packageVersion("dplyr")
[1] ‘0.5.0’
> packageVersion("data.table")
[1] ‘1.9.6’

UPDATE: Rerun 31-Jan-2018. Ran on the same computer. New versions of packages. Added seed for seed lovers.

set.seed(21)
library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  plyr::rbind.fill(dflist),
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  plyr::ldply(dflist,data.frame),
  do.call("rbind",dflist),
  times=1000)

ggplot2::autoplot(mb)+theme_bw()


R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

> packageVersion("plyr")
[1] ‘1.8.4’
> packageVersion("dplyr")
[1] ‘0.7.2’
> packageVersion("data.table")
[1] ‘1.10.4’

UPDATE: Rerun 06-Aug-2019.

set.seed(21)
library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  plyr::rbind.fill(dflist),
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  plyr::ldply(dflist,data.frame),
  do.call("rbind",dflist),
  purrr::map_df(dflist,dplyr::bind_rows),
  times=1000)

ggplot2::autoplot(mb)+theme_bw()

R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

packageVersion("plyr")
packageVersion("dplyr")
packageVersion("data.table")
packageVersion("purrr")

>> packageVersion("plyr")
[1] ‘1.8.4’
>> packageVersion("dplyr")
[1] ‘0.8.3’
>> packageVersion("data.table")
[1] ‘1.12.2’
>> packageVersion("purrr")
[1] ‘0.3.2’

UPDATE: Rerun 18-Nov-2021.

set.seed(21)
library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  plyr::rbind.fill(dflist),
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  plyr::ldply(dflist,data.frame),
  do.call("rbind",dflist),
  Reduce("rbind",dflist),
  purrr::map_df(dflist,dplyr::bind_rows),
  times=1000)

ggplot2::autoplot(mb)+theme_bw()

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

>packageVersion("plyr")
[1] ‘1.8.6’
> packageVersion("dplyr")
[1] ‘1.0.7’
> packageVersion("data.table")
[1] ‘1.14.2’
> packageVersion("purrr")
[1] ‘0.3.4’

UPDATE: Rerun 14-Dec-2023. Updated version of R and packages, but same machine. Also added Mael's answer with collapse package.

set.seed(21)
library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100) dflist[[i]] <- data.frame(a=runif(n=260), b=runif(n=260), c=rep(LETTERS,10), d=rep(LETTERS,10))

mb <- microbenchmark(
  plyr::rbind.fill(dflist),
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  plyr::ldply(dflist,data.frame),
  do.call("rbind",dflist),
  Reduce("rbind",dflist),
  purrr::map_df(dflist,dplyr::bind_rows),
  collapse::unlist2d(dflist),
  times=1000)

ggplot2::autoplot(mb)+ggplot2::theme_bw()

Unit: microseconds
                                    expr     min       lq       mean   median       uq      max neval
                plyr::rbind.fill(dflist)  7146.5  7514.00  8270.0111  7753.80  8112.95  23541.9  1000
                dplyr::bind_rows(dflist)  1319.7  1476.35  1772.7105  1566.60  1668.40  13891.0  1000
           data.table::rbindlist(dflist)   466.8   603.20   828.2240   661.35   713.50  85147.2  1000
         plyr::ldply(dflist, data.frame) 14537.2 15161.20 16701.7973 15556.20 16417.50 101030.9  1000
                do.call("rbind", dflist)  9345.5  9851.10 12995.3883 10408.55 16046.65 110697.6  1000
                 Reduce("rbind", dflist) 71055.8 78980.90 97294.2492 83496.15 91894.60 285057.9  1000
 purrr::map_df(dflist, dplyr::bind_rows) 15097.9 15654.65 17361.9197 16020.10 16681.60 660603.4  1000
              collapse::unlist2d(dflist)   186.7   254.80   528.7272   312.90   351.90  88946.8  1000

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

> packageVersion("plyr")
[1] ‘1.8.9’
> packageVersion("dplyr")
[1] ‘1.1.4’
> packageVersion("data.table")
[1] ‘1.14.10’
> packageVersion("purrr")
[1] ‘1.0.2’
> packageVersion("collapse")
[1] ‘2.0.7’
> packageVersion("microbenchmark")
[1] ‘1.4.10’

Wo answered 21/7, 2016 at 16:32 Comment(1)

This is a great answer. I ran the same thing (same OS, same packages, different randomization because you don't set.seed) but saw some differences in worst-case performance. rbindlist actually had the best worst-case as well as best typical-case in my results – Health 19/10, 2016 at 13:46

There is also bind_rows(x, ...) in dplyr.

> system.time({ df.Base <- do.call("rbind", listOfDataFrames) })
   user  system elapsed 
   0.08    0.00    0.07 
> 
> system.time({ df.dplyr <- as.data.frame(bind_rows(listOfDataFrames)) })
   user  system elapsed 
   0.01    0.00    0.02 
> 
> identical(df.Base, df.dplyr)
[1] TRUE

Underneath answered 29/4, 2015 at 0:32 Comment(1)

technically speaking you do not need the as.data.frame - all that does it makes it exclusively a data.frame, as opposed to also a table_df (from deplyr) – Sagacious 1/6, 2015 at 18:6

Here's another way this can be done (just adding it to the answers because reduce is a very effective functional tool that is often overlooked as a replacement for loops. In this particular case, neither of these are significantly faster than do.call)

using base R:

df <- Reduce(rbind, listOfDataFrames)

or, using the tidyverse:

library(tidyverse) # or, library(dplyr); library(purrr)
df <- listOfDataFrames %>% reduce(bind_rows)

Ditch answered 17/5, 2017 at 2:26 Comment(1)

I always use base unless a package is really so much better. . .thanks for base solution – Radian 16/3, 2022 at 22:10

How it should be done in the tidyverse:

df.dplyr.purrr <- listOfDataFrames %>% map_df(bind_rows)

Carrollcarronade answered 16/5, 2017 at 13:27 Comment(2)

Why would you use map if bind_rows can take a list of dataframes? – Organometallic 24/7, 2018 at 14:51

As of purrr 1.0.0 *_df() functions have been superseded. – Melodie 22/11, 2023 at 19:41

The only thing that the solutions with data.table are missing is the identifier column to know from which dataframe in the list the data is coming from.

Something like this:

df_id <- data.table::rbindlist(listOfDataFrames, idcol = TRUE)

The idcol parameter adds a column (.id) identifying the origin of the dataframe contained in the list. The result would look to something like this:

.id a         b           c
1   u   -0.05315128 -1.31975849 
1   b   -1.00404849 1.15257952  
1   y   1.17478229  -0.91043925 
1   q   -1.65488899 0.05846295  
1   c   -1.43730524 0.95245909  
1   b   0.56434313  0.93813197

Boardman answered 22/6, 2017 at 15:30 Comment(0)

An updated visual for those wanting to compare some of the recent answers (I wanted to compare the purrr to dplyr solution). Basically I combined answers from @TheVTM and @rmf.

Code:

library(microbenchmark)
library(data.table)
library(tidyverse)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  purrr::map_df(dflist, bind_rows),
  do.call("rbind",dflist),
  times=500)

ggplot2::autoplot(mb)

Session Info:

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Package Versions:

> packageVersion("tidyverse")
[1] ‘1.1.1’
> packageVersion("data.table")
[1] ‘1.10.0’

Nihon answered 22/8, 2017 at 17:4 Comment(0)

A very fast option is the C-written collapse::unlist2d:

library(collapse)
unlist2d(listOfDataFrames)

Since purrr 1.0.0, another option is list_rbind (which supersedes flatten_dfr and (partly) map_dfr):

library(purrr)
list_rbind(listOfDataFrames, names_to = "column_label")

unlist2d is faster than any other option:

library(microbenchmark)
mb <- microbenchmark(
  bind_rows = dplyr::bind_rows(listOfDataFrames),
  rbindlist = data.table::rbindlist(listOfDataFrames),
  do.call = do.call("rbind", listOfDataFrames),
  list_rbind = list_rbind(listOfDataFrames),
  unlist2d = unlist2d(listOfDataFrames), times = 1000)

# Unit: microseconds
#        expr       min         lq       mean     median         uq      max neval
#   bind_rows  1590.601  2139.7010  2807.0167  2335.2510  2717.3510  51773.4  1000
#   rbindlist   613.401   890.6015  1438.0510  1012.7505  1318.3015  13893.4  1000
#     do.call 12009.201 19973.7010 25526.6986 22102.8005 25239.9510 151914.2  1000
#  list_rbind  1287.401  1781.1510  2510.0104  1970.5515  2282.3515 115803.6  1000
#    unlist2d   245.401   400.3015   937.1731   488.3005   690.5015  12683.0  1000

Goby answered 22/12, 2022 at 10:7 Comment(0)

Unfortunately, the best answer data.table::rbindlist(x, fill=TRUE) didn't work for me. Instead, it corrupted my data, which I found out only during analysis when some rows that had value before merging lost their value after merging.

Other solutions using merge or rbind also didn't work due to a differing number of columns.

So I had to develop my own special solution. Its two short functions with base! Nothing else required.

The first issue is that we can't subset data.frame with non-existing columns. But if we solve that, we can just subset those data.frames and rbind the result.

subset_missing = function(x, select){
    y = lapply(select, \(y) if(y %in% names(x)) x[[y]] else NA)
    data.frame(y) |> setNames(select)
    }

This is a cleaned solution from another StackOverflow answer.

Once we have that, the rest is easy!

merge_df = function(x){
    nm = lapply(x, colnames) |> unlist() |> unique()
    y = lapply(x, subset_missing, select = nm)
    do.call(what = rbind, y)
    }

Now, lets test it:

df1 = data.frame(a = c(1:5), b = c(6:10))
df2 = data.frame(a = c(11:15), b = c(16:20), c = LETTERS[1:5])

merge_df(list(df1, df2))

#     a  b    c
# 1   1  6 <NA>
# 2   2  7 <NA>
# 3   3  8 <NA>
# 4   4  9 <NA>
# 5   5 10 <NA>
# 6  11 16    A
# 7  12 17    B
# 8  13 18    C
# 9  14 19    D
# 10 15 20    E

It will be definitely slower than many solutions here since it is solving slightly different problem, but it actually works for my issue, so I hope that other people will have luck with it.

Obsolete answered 1/2 at 3:6 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags