Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

Asked 15/3, 2012 at 21:25 Answered 5/9, 2012 at 22:46

Solved performance r plyr data.table do.call

I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.

The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.

I've tried:

df <- do.call("rbind", list)
df <- ldply(list)

but I've had to kill the process after letting it run up to 3 hours and not getting anything back.

Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?

FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] multicore_0.1-7 plyr_1.7.1      rjson_0.2.6    

loaded via a namespace (and not attached):
[1] tools_2.14.1
>

Statuary answered 15/3, 2012 at 21:25 Comment(18)

I can't promise this will lead to a good solution, but it might be worthwhile to investigate how many of your df's you can rbind before running into trouble. 1k? 10k? 100k? – Ungovernable 15/3, 2012 at 21:34

@Ungovernable I profiled earlier but here are recent metrics for system.time for ldply(list). 1k: 0.994 0.001 0.995 , 10k: 19.896 2.126 22.025, 100k: still going... – Statuary 15/3, 2012 at 21:47

You might want to check out Shane's answer and Hadley's comments here, if you haven't already - rbind.fill might be faster. – Liger 15/3, 2012 at 21:51

And, an utter shot in the dark: what happens if you use Reduce(list, f = rbind)? – Liger 15/3, 2012 at 21:56

system.time for 100k: 1153.597 17.126 1170.930 - exponential growth... – Statuary 15/3, 2012 at 22:3

@MattParker I had seen that question and haven't tried rbind.fill. I thought rbind.fill was for when you have df's with different widths and field names. ALso, Reduce()?? didn't even know existed. Will try. Thanks – Statuary 15/3, 2012 at 22:4

@MattParker system.time(Reduce(list, f = rbind)): 2.237 0.000 2.238 . slower than ldply. Tried Reduce(list(f = ldply)) and got an error "Error: attempt to apply non-function" – Statuary 15/3, 2012 at 22:9

Try Dominik's answer here – Landers 15/3, 2012 at 22:15

rbind.fill and ldply have similar performance and in both cases an increment of 10x in list lenght causes an increment of ~50x in time. – Statuary 15/3, 2012 at 22:28

data.table and rbindlist are the way to go here! – Inaugurate 5/9, 2012 at 22:50

@wahalulu. When you get a chance to look at the new answer, using a new function in data.table, could you consider moving the accept to mnel's answer please? However, I don't know how S.O. etiquette works when a better answer comes along a long time later, especially when that new answer uses new features not available originally. rbindlist is a conclusive solution though, which is many times faster than do.call("bind",...), and this question and answers are all about speed for large data. – Chetnik 6/9, 2012 at 12:53

@MatthewDowle: don't count me out yet! I have an updated solution that gets pretty close. Thanks for the inspiration! ;-) – Marshy 6/9, 2012 at 16:21

Ok! Make sure to compare how each scales; e.g., a 10GB example :) Is your solution in C? – Chetnik 6/9, 2012 at 17:11

@MatthewDowle: I added my new solution to my answer; it only uses base R. I'll test it on a 10GB data set as soon as I'm able to access a computer with enough RAM. :) – Marshy 6/9, 2012 at 19:3

@JoshuaUlrich Maybe 10GB isn't necessary. I now realise the test data for this benchmark is just 40MB. It appears quite shocking how slow do.call(rbind,...) is then on that tiny size. But, the output, at 250,000 rows and 2 columns is just 3MB. So maybe it's related to this benchmark: a large number (50,000) of very small data.frame (5x2) with all those (identical) column names repeated over and over in the input. Perhaps do.call is checking all those column name vectors. Anyway, we could scale up from 40MB to 400MB at least. – Chetnik 7/9, 2012 at 10:43

@MatthewDowle I just saw mnel's new answer. I hadn't visited SO in a while. Perhaps I should keep this question as is with the accepted answer, and ask/answer a new question specifically for large data, and copy mnels answer? – Statuary 11/10, 2012 at 14:26

@Statuary I'm not sure what's best. Perhaps ask in the R chat room. – Chetnik 11/10, 2012 at 14:35

wow, I just tried rbindlist on another case and it rocks! – Statuary 15/11, 2012 at 19:19

Given that you are looking for performance, it appears that a data.table solution should be suggested.

There is a function rbindlist which is the same but much faster than do.call(rbind, list)

library(data.table)
X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE)
system.time(rbindlist.data.table <- rbindlist(X))
##  user  system elapsed 
##  0.00    0.01    0.02

It is also very fast for a list of data.frame

Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)

system.time(rbindlist.data.frame <- rbindlist(Xdf))
##  user  system elapsed 
##  0.03    0.00    0.03

For comparison

system.time(docall <- do.call(rbind, Xdf))
##  user  system elapsed 
## 50.72    9.89   60.88

And some proper benchmarking

library(rbenchmark)
benchmark(rbindlist.data.table = rbindlist(X), 
           rbindlist.data.frame = rbindlist(Xdf),
           docall = do.call(rbind, Xdf),
           replications = 5)
##                   test replications elapsed    relative user.self sys.self 
## 3               docall            5  276.61 3073.444445    264.08     11.4 
## 2 rbindlist.data.frame            5    0.11    1.222222      0.11      0.0 
## 1 rbindlist.data.table            5    0.09    1.000000      0.09      0.0

and against @JoshuaUlrich's solutions

benchmark(use.rbl.dt  = rbl.dt(X), 
          use.rbl.ju  = rbl.ju (Xdf),
          use.rbindlist =rbindlist(X) ,
          replications = 5)

##              test replications elapsed relative user.self 
## 3  use.rbindlist            5    0.10      1.0      0.09
## 1     use.rbl.dt            5    0.10      1.0      0.09
## 2     use.rbl.ju            5    0.33      3.3      0.31

I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame

Inaugurate answered 5/9, 2012 at 22:46 Comment(5)

How did you get do.call(rbind,Xdf) to run in 60s? It takes 90s on my 3.33Ghz dual core CPU. Maybe your RAM is significantly faster than mine? – Marshy 6/9, 2012 at 16:28

Perhaps -- my machine is 3.00 Gh Core 2 Duo running windows XP -- work computer. I've added benchmarking to get a more reliable estimate. – Inaugurate 6/9, 2012 at 23:6

I am updating the accepted answer to @Inaugurate use of rbindlist in data.table package. This is by far the fastest and most efficient answer, but I would like to say that all previous answers are also very creative! – Statuary 15/11, 2012 at 19:21

Goodbye check-mark. I will always remember our time together... :'( – Marshy 15/11, 2012 at 19:51

@JoshuaUlrich hey, I need to check out your updated answer... I wish I could give two checkmarks per question ;) – Statuary 22/1, 2013 at 19:41

rbind.data.frame does a lot of checking you don't need. This should be a pretty quick transformation if you only do exactly what you want.

# Use data from Josh O'Brien's post.
set.seed(21)
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)
system.time({
Names <- names(X[[1]])  # Get data.frame names from first list element.
# For each name, extract its values from each data.frame in the list.
# This provides a list with an element for each name.
Xb <- lapply(Names, function(x) unlist(lapply(X, `[[`, x)))
names(Xb) <- Names          # Give Xb the correct names.
Xb.df <- as.data.frame(Xb)  # Convert Xb to a data.frame.
})
#    user  system elapsed 
#   3.356   0.024   3.388 
system.time(X1 <- do.call(rbind, X))
#    user  system elapsed 
# 169.627   6.680 179.675
identical(X1,Xb.df)
# [1] TRUE

Inspired by the data.table answer, I decided to try and make this even faster. Here's my updated solution, to try and keep the check mark. ;-)

# My "rbind list" function
rbl.ju <- function(x) {
  u <- unlist(x, recursive=FALSE)
  n <- names(u)
  un <- unique(n)
  l <- lapply(un, function(N) unlist(u[N==n], FALSE, FALSE))
  names(l) <- un
  d <- as.data.frame(l)
}
# simple wrapper to rbindlist that returns a data.frame
rbl.dt <- function(x) {
  as.data.frame(rbindlist(x))
}

library(data.table)
if(packageVersion("data.table") >= '1.8.2') {
  system.time(dt <- rbl.dt(X))  # rbindlist only exists in recent versions
}
#    user  system elapsed 
#    0.02    0.00    0.02
system.time(ju <- rbl.ju(X))
#    user  system elapsed 
#    0.05    0.00    0.05 
identical(dt,ju)
# [1] TRUE

Marshy answered 16/3, 2012 at 0:39 Comment(4)

+1 -- This is a great idea. My suggestion is close enough in its speed that I wonder whether it could beat this approach, if more carefully tuned. But realistically, this is the way I'd go. – Beseech 16/3, 2012 at 2:6

+1 -- Should all the columns have the same class, I would also consider as.data.frame(matrix(unlist(X), nrow = length(X), byrow=TRUE)). – Burgwell 16/3, 2012 at 2:21

Thanks guys, this looks great. I'll implement this tomorrow morning and let you know the results. – Statuary 16/3, 2012 at 2:32

Beautiful! ~1m dfs into a single df an about 600 sec. I can live with that performance, and this will become the method to merge df's in lists from now on, as long as they have the same structure which is usually my case. – Statuary 16/3, 2012 at 14:33

Your observation that the time taken increases exponentially with the number of data.frames suggests that breaking the rbinding into two stages could speed things up.

This simple experiment seems to confirm that that's a very fruitful path to take:

## Make a list of 50,000 data.frames
X <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)

## First, rbind together all 50,000 data.frames in a single step
system.time({
    X1 <- do.call(rbind, X)
})
#    user  system elapsed 
# 137.08   57.98  200.08 


## Doing it in two stages cuts the processing time by >95%
##   - In Stage 1, 100 groups of 500 data.frames are rbind'ed together
##   - In Stage 2, the resultant 100 data.frames are rbind'ed
system.time({
    X2 <- lapply(1:100, function(i) do.call(rbind, X[((i*500)-499):(i*500)]))
    X3 <- do.call(rbind, X2)
}) 
#    user  system elapsed 
#    6.14    0.05    6.21 


## Checking that the results are the same
identical(X1, X3)
# [1] TRUE

Younglove answered 15/3, 2012 at 22:45 Comment(1)

rbind.fill gives me almos the same timing (less than 10% difference) from your X2/X3 solution. – Holmquist 16/3, 2012 at 2:54

You have a list of data.frames that each have a single row. If it is possible to convert each of those to a vector, I think that would speed things up a lot.

However, assuming that they need to be data.frames, I'll create a function with code borrowed from Dominik's answer at Can rbind be parallelized in R?

do.call.rbind <- function (lst) {
  while (length(lst) > 1) {
    idxlst <- seq(from = 1, to = length(lst), by = 2)
    lst <- lapply(idxlst, function(i) {
      if (i == length(lst)) {
        return(lst[[i]])
      }
      return(rbind(lst[[i]], lst[[i + 1]]))
    })
  }
  lst[[1]]
}

I have been using this function for several months, and have found it to be faster and use less memory than do.call(rbind, ...) [the disclaimer is that I've pretty much only used it on xts objects]

The more rows that each data.frame has, and the more elements that the list has, the more beneficial this function will be.

If you have a list of 100,000 numeric vectors, do.call(rbind, ...) will be better. If you have list of length one billion, this will be better.

> df <- lapply(1:10000, function(x) data.frame(x = sample(21, 21)))
> library(rbenchmark)
> benchmark(a=do.call(rbind, df), b=do.call.rbind(df))
test replications elapsed relative user.self sys.self user.child sys.child
1    a          100 327.728 1.755965   248.620   79.099          0         0
2    b          100 186.637 1.000000   181.874    4.751          0         0

The relative speed up will be exponentially better as you increase the length of the list.

Landers answered 15/3, 2012 at 23:36 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

and against @JoshuaUlrich's solutions

Recommended topics

Hot tags