What's the fastest way to merge/join data.frames in R?
Asked Answered
A

6

111

For example (not sure if most representative example though):

N <- 1e6
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N), y2=rnorm(N))

This is what I've got so far:

d <- merge(d1,d2)
# 7.6 sec

library(plyr)
d <- join(d1,d2)
# 2.9 sec

library(data.table)
dt1 <- data.table(d1, key="x")
dt2 <- data.table(d2, key="x")
d <- data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )
# 4.9 sec

library(sqldf)
sqldf()
sqldf("create index ix1 on d1(x)")
sqldf("create index ix2 on d2(x)")
d <- sqldf("select * from d1 inner join d2 on d1.x=d2.x")
sqldf()
# 17.4 sec
Almund answered 1/12, 2010 at 7:47 Comment(3)
The proper way to do the sqldf way is pointed out below by Gabor: create only one index (say on d1) and use d1.main instead of d1 in the select statement (otherwise it won't use the index). Timing is in this case 13.6 sec. Building indexes on both tables is actually not necessary in the data.table case either, just do "dt2 <- data.table(d2)" and the timing will be 3.9 sec.Almund
Both answers provide valuable information, worth reading both (though only one can be "accepted").Almund
you are comparing left join to inner join in your questionHedda
T
52

The match approach works when there is a unique key in the second data frame for each key value in the first. If there are duplicates in the second data frame then the match and merge approaches are not the same. Match is, of course, faster since it is not doing as much. In particular it never looks for duplicate keys. (continued after code)

DF1 = data.frame(a = c(1, 1, 2, 2), b = 1:4)
DF2 = data.frame(b = c(1, 2, 3, 3, 4), c = letters[1:5])
merge(DF1, DF2)
    b a c
  1 1 1 a
  2 2 1 b
  3 3 2 c
  4 3 2 d
  5 4 2 e
DF1$c = DF2$c[match(DF1$b, DF2$b)]
DF1$c
[1] a b c e
Levels: a b c d e

> DF1
  a b c
1 1 1 a
2 1 2 b
3 2 3 c
4 2 4 e

In the sqldf code that was posted in the question, it might appear that indexes were used on the two tables but, in fact, they are placed on tables which were overwritten before the sql select ever runs and that, in part, accounts for why its so slow. The idea of sqldf is that the data frames in your R session constitute the data base, not the tables in sqlite. Thus each time the code refers to an unqualified table name it will look in your R workspace for it -- not in sqlite's main database. Thus the select statement that was shown reads d1 and d2 from the workspace into sqlite's main database clobbering the ones that were there with the indexes. As a result it does a join with no indexes. If you wanted to make use of the versions of d1 and d2 that were in sqlite's main database you would have to refer to them as main.d1 and main.d2 and not as d1 and d2. Also, if you are trying to make it run as fast as possible then note that a simple join can't make use of indexes on both tables so you can save the time of creating one of the indexes. In the code below we illustrate these points.

Its worthwhile to notice that the precise computation can make a huge difference on which package is fastest. For example, we do a merge and an aggregate below. We see that the results are nearly reversed for the two. In the first example from fastest to slowest we get: data.table, plyr, merge and sqldf whereas in the second example sqldf, aggregate, data.table and plyr -- nearly the reverse of the first one. In the first example sqldf is 3x slower than data.table and in the second its 200x faster than plyr and 100 times faster than data.table. Below we show the input code, the output timings for the merge and the output timings for the aggregate. Its also worthwhile noting that sqldf is based on a database and therefore can handle objects larger than R can handle (if you use the dbname argument of sqldf) while the other approaches are limited to processing in main memory. Also we have illustrated sqldf with sqlite but it also supports the H2 and PostgreSQL databases as well.

library(plyr)
library(data.table)
library(sqldf)

set.seed(123)
N <- 1e5
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N), y2=rnorm(N))

g1 <- sample(1:1000, N, replace = TRUE)
g2<- sample(1:1000, N, replace = TRUE)
d <- data.frame(d1, g1, g2)

library(rbenchmark)

benchmark(replications = 1, order = "elapsed",
   merge = merge(d1, d2),
   plyr = join(d1, d2),
   data.table = { 
      dt1 <- data.table(d1, key = "x")
      dt2 <- data.table(d2, key = "x")
      data.frame( dt1[dt2,list(x,y1,y2=dt2$y2)] )
      },
   sqldf = sqldf(c("create index ix1 on d1(x)",
      "select * from main.d1 join d2 using(x)"))
)

set.seed(123)
N <- 1e5
g1 <- sample(1:1000, N, replace = TRUE)
g2<- sample(1:1000, N, replace = TRUE)
d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)

benchmark(replications = 1, order = "elapsed",
   aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean), 
   data.table = {
      dt <- data.table(d, key = "g1,g2")
      dt[, colMeans(cbind(x, y)), by = "g1,g2"]
   },
   plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)),
   sqldf = sqldf(c("create index ix on d(g1, g2)",
      "select g1, g2, avg(x), avg(y) from main.d group by g1, g2"))
)

The outputs from the two benchmark call comparing the merge calculations are:

Joining by: x
        test replications elapsed relative user.self sys.self user.child sys.child
3 data.table            1    0.34 1.000000      0.31     0.01         NA        NA
2       plyr            1    0.44 1.294118      0.39     0.02         NA        NA
1      merge            1    1.17 3.441176      1.10     0.04         NA        NA
4      sqldf            1    3.34 9.823529      3.24     0.04         NA        NA

The output from the benchmark call comparing the aggregate calculations are:

        test replications elapsed  relative user.self sys.self user.child sys.child
4      sqldf            1    2.81  1.000000      2.73     0.02         NA        NA
1  aggregate            1   14.89  5.298932     14.89     0.00         NA        NA
2 data.table            1  132.46 47.138790    131.70     0.08         NA        NA
3       plyr            1  212.69 75.690391    211.57     0.56         NA        NA
Theoretical answered 1/12, 2010 at 21:51 Comment(2)
Thank you, Gabor. Excellent points, I made some adjustments via comments to the original question. Actually I guess the order might change even in the "merge" case depending on the relative sizes of the tables, multiplicity of keys etc. (that's why I said I'm not sure if my example is representative). Nonetheless, it's nice to see all the different solutions to the problem.Almund
I appreciate also the comment about the "aggregation" case. Although this is different than the "merge" setup in the question, it is very relevant. I would have actually asked about it in a separate question, but there is already one here #3685992 . You might want to contribute to that one as well, as based on the results above, the sqldf solution might beat all the existing answers there ;)Almund
P
42

The 132 seconds reported in Gabor's results for data.table is actually timing base functions colMeans and cbind (the memory allocation and copying induced by using those functions). There are good and bad ways of using data.table, too.

benchmark(replications = 1, order = "elapsed", 
  aggregate = aggregate(d[c("x", "y")], d[c("g1", "g2")], mean),
  data.tableBad = {
     dt <- data.table(d, key = "g1,g2") 
     dt[, colMeans(cbind(x, y)), by = "g1,g2"]
  }, 
  data.tableGood = {
     dt <- data.table(d, key = "g1,g2") 
     dt[, list(mean(x),mean(y)), by = "g1,g2"]
  }, 
  plyr = ddply(d, .(g1, g2), summarise, avx = mean(x), avy=mean(y)),
  sqldf = sqldf(c("create index ix on d(g1, g2)",
      "select g1, g2, avg(x), avg(y) from main.d group by g1, g2"))
  ) 

            test replications elapsed relative user.self sys.self
3 data.tableGood            1    0.15    1.000      0.16     0.00
5          sqldf            1    1.01    6.733      1.01     0.00
2  data.tableBad            1    1.63   10.867      1.61     0.01
1      aggregate            1    6.40   42.667      6.38     0.00
4           plyr            1  317.97 2119.800    265.12    51.05

packageVersion("data.table")
# [1] ‘1.8.2’
packageVersion("plyr")
# [1] ‘1.7.1’
packageVersion("sqldf")
# [1] ‘0.4.6.4’
R.version.string
# R version 2.15.1 (2012-06-22)

Please note that I don't know plyr well so please do check with Hadley before relying on the plyr timings here. Also note that the data.table do include the time to convert to data.table and set the key, for fareness.


This answer has been updated since originally answered in Dec 2010. The previous benchmark results are below. Please see revision history of this answer to see what changed.

              test replications elapsed   relative user.self sys.self
4   data.tableBest            1   0.532   1.000000     0.488    0.020
7            sqldf            1   2.059   3.870301     2.041    0.008
3 data.tableBetter            1   9.580  18.007519     9.213    0.220
1        aggregate            1  14.864  27.939850    13.937    0.316
2  data.tableWorst            1 152.046 285.800752   150.173    0.556
6 plyrwithInternal            1 198.283 372.712406   189.391    7.665
5             plyr            1 225.726 424.296992   208.013    8.004
Perron answered 19/12, 2010 at 13:49 Comment(9)
Because ddply works only with data frames, this is example yields worst case performance. I hope to have a better interface for this type of common operation in a future version.Judgment
FYI: you can't use .Internal calls in CRAN packages, see the CRAN Repository Policy.Acculturation
@JoshuaUlrich You could when the answer was written nearly 2 years ago, iirc. I'll update this answer as data.table automatically optimizes mean now (without calling .Internal internally).Perron
@MatthewDowle: Yeah, I'm not sure when/if it changed. I just know it's the case now. And it's perfectly fine in your answer, just won't work in packages.Acculturation
I was wondering, if vectorized merge is (or will be implemented) in data.table (so that we could pass a vector/list of data frames/data tables as a single argument instead of looping)?Olag
@AleksandrBlekh I don't quite follow. Do you have say 4 data.tables and you want to merge them all on common values? e.g. merge(W,X,Y,Z) except you can't because merge and X[Y] take only 2 tables at a time? If so, interested in which field this occurs and examples: something similar has been raised before.Perron
@MattDowle: Yes, that's correct. I'd like to pass a list of data frames as a single argument. Here's the scenario, where I would need this feature: github.com/abnova/diss-floss/blob/master/analysis/sem-models/… (test client for merging indicator-based data frames into a single data frame/table for further SEM analysis). Note, how I have to use either for, or lapply, to merge all data tables (options data.table & data.table2; I messed up the latter - would appreciate your advise on fixing it).Olag
@AleksandrBlekh Thanks. I've linked your comments here to existing feature request #599. Let's move to there. Your example code nicely shows the for loop, that's good. Could you add more info about "SEM analysis" to that issue? For example I'm guessing that SEM=Scanning electron microscope? Knowing more about the application makes it more interesting to us and helps us prioritise.Perron
@MattDowle: This is just to let you know that I posted my comment at data.table's issue tracker at GitHub in continuation of the discussion (in case you didn't see it). Issue #599, as you requested.Olag
R
17

For simple task (unique values on both sides of join) I use match:

system.time({
    d <- d1
    d$y2 <- d2$y2[match(d1$x,d2$x)]
})

It's far more faster than merge (on my machine 0.13s to 3.37s).

My timings:

  • merge: 3.32s
  • plyr: 0.84s
  • match: 0.12s
Renfred answered 1/12, 2010 at 11:21 Comment(1)
Thank you, Marek. Some explanation of why this is so fast (builds an index/hash table) can be found here: tolstoy.newcastle.edu.au/R/help/01c/2739.htmlAlmund
B
11

Thought it would be interesting to post a benchmark with dplyr in the mix: (had a lot of things running)

            test replications elapsed relative user.self sys.self
5          dplyr            1    0.25     1.00      0.25     0.00
3 data.tableGood            1    0.28     1.12      0.27     0.00
6          sqldf            1    0.58     2.32      0.57     0.00
2  data.tableBad            1    1.10     4.40      1.09     0.01
1      aggregate            1    4.79    19.16      4.73     0.02
4           plyr            1  186.70   746.80    152.11    30.27

packageVersion("data.table")
[1] ‘1.8.10’
packageVersion("plyr")
[1] ‘1.8’
packageVersion("sqldf")
[1] ‘0.4.7’
packageVersion("dplyr")
[1] ‘0.1.2’
R.version.string
[1] "R version 3.0.2 (2013-09-25)"

Just added:

dplyr = summarise(dt_dt, avx = mean(x), avy = mean(y))

and setup the data for dplyr with a data table:

dt <- tbl_dt(d)
dt_dt <- group_by(dt, g1, g2)

Updated: I removed data.tableBad and plyr and nothing but RStudio open (i7, 16GB ram).

With data.table 1.9 and dplyr with data frame:

            test replications elapsed relative user.self sys.self
2 data.tableGood            1    0.02      1.0      0.02     0.00
3          dplyr            1    0.04      2.0      0.04     0.00
4          sqldf            1    0.46     23.0      0.46     0.00
1      aggregate            1    6.11    305.5      6.10     0.02

With data.table 1.9 and dplyr with data table:

            test replications elapsed relative user.self sys.self
2 data.tableGood            1    0.02        1      0.02     0.00
3          dplyr            1    0.02        1      0.02     0.00
4          sqldf            1    0.44       22      0.43     0.02
1      aggregate            1    6.14      307      6.10     0.01

packageVersion("data.table")
[1] '1.9.0'
packageVersion("dplyr")
[1] '0.1.2'

For consistency here is the original with all and data.table 1.9 and dplyr using a data table:

            test replications elapsed relative user.self sys.self
5          dplyr            1    0.01        1      0.02     0.00
3 data.tableGood            1    0.02        2      0.01     0.00
6          sqldf            1    0.47       47      0.46     0.00
1      aggregate            1    6.16      616      6.16     0.00
2  data.tableBad            1   15.45     1545     15.38     0.01
4           plyr            1  110.23    11023     90.46    19.52

I think this data is too small for the new data.table and dplyr :)

Larger data set:

N <- 1e8
g1 <- sample(1:50000, N, replace = TRUE)
g2<- sample(1:50000, N, replace = TRUE)
d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)

Took around 10-13GB of ram just to hold the data before running the benchmark.

Results:

            test replications elapsed relative user.self sys.self
1          dplyr            1   14.88        1      6.24     7.52
2 data.tableGood            1   28.41        1     18.55      9.4

Tried a 1 billion but blew up ram. 32GB will handle it no problem.


[Edit by Arun] (dotcomken , could you please run this code and paste your benchmarking results? Thanks).

require(data.table)
require(dplyr)
require(rbenchmark)

N <- 1e8
g1 <- sample(1:50000, N, replace = TRUE)
g2 <- sample(1:50000, N, replace = TRUE)
d <- data.frame(x=sample(N,N), y=rnorm(N), g1, g2)

benchmark(replications = 5, order = "elapsed", 
  data.table = {
     dt <- as.data.table(d) 
     dt[, lapply(.SD, mean), by = "g1,g2"]
  }, 
  dplyr_DF = d %.% group_by(g1, g2) %.% summarise(avx = mean(x), avy=mean(y))
) 

As per Arun's request here the output of what you provided me to run:

        test replications elapsed relative user.self sys.self
1 data.table            5   15.35     1.00     13.77     1.57
2   dplyr_DF            5  137.84     8.98    136.31     1.44

Sorry for the confusion, late night got to me.

Using dplyr with data frame seems to be the less efficient way to process summaries. Is this methods to compare the exact functionality of data.table and dplyr with their data structure methods included? I'd almost prefer to separate that as most data will need to be cleaned before we group_by or create the data.table. It could be a matter of taste but I think the most important part is how efficiently the data can be modeled.

Beta answered 26/2, 2014 at 22:45 Comment(4)
Nice update. Thanks. I think your machine is a beast compared to this data set.. What's the size of your L2 cache (and L3 if exists)?Cracked
i7 L2 is 2x256 KB 8-way, L3 is 4 MB 16-way. 128 GB SSD, Win 7 on a Dell inspironBeta
Could you reformat your example. I'm a bit confused. Is data.table better (in this example) than dplyr? If so, under what circumstances.Yuma
I don't understand... Why summarizing here? We're looking for join optionsBrackish
B
3

Recent answers to an old question: collapse::join provides a (very) fast alternative to other options. I also add here a more recent approach to data.table joins with on. Use nomatch = 0 to do an inner join.

library(collapse)
join(df1, df2, how = "inner") #`how` defaults to "left"

library(data.table)
df1[df2, nomatch = 0]

collapse::join uses collapse::fmatch as a workhorse.

d2$y2[fmatch(d1$x, d2$x)]

Benchmark on two data.frames of 1,000,000 rows:

  • Comparing them with dplyr, match or merge, collapse::join is definitely the way to go for fast joins!
  • Compared with match, collapse::fmatch is also faster.

enter image description here


Data and benchmark code

library(data.table)
library(microbenchmark)
library(dplyr)
library(collapse)

N <- 1e6
d1 <- data.frame(x=sample(N,N), y1=rnorm(N))
d2 <- data.frame(x=sample(N,N), y2=rnorm(N))
dt1 <- as.data.table(d1)
dt2 <- as.data.table(d2)

mb <- 
  microbenchmark(
    "collapse::join" = join(d1, d2, how = "inner"),
    "dt" = dt1[dt2, on = "x", nomatch = 0], 
    "merge" = merge(d1, d2),
    "dplyr" = inner_join(d1, d2),
    
    #The match alternatives for one column joins
    "collapse::fmatch" = d2$y2[fmatch(d1$x, d2$x)],
    "match" = d2$y2[match(d1$x, d2$x)],
    times = 10
  )

# Unit: milliseconds
#              expr       min        lq      mean    median        uq       max neval
#    collapse::join   84.2238   90.0846  103.7113   95.5860  114.3395  141.3556    10
#                dt  209.4132  225.4013  294.2514  249.1585  337.0275  549.9454    10
#             merge 2557.9797 2569.9302 2738.9485 2747.1723 2847.5932 3006.3898    10
#             dplyr  656.8314  698.9619  738.5461  722.4017  744.4237  916.8370    10
#  collapse::fmatch   82.0418   82.3524  106.1968   84.3080   93.5918  276.9252    10
#             match  133.2351  142.0613  195.5774  156.9606  222.0466  353.4924    10
Boreal answered 8/9, 2023 at 8:20 Comment(1)
Thanks, please note though that collapse::join() defaults to the left join (because it is the most common and computationally inexpensive form of join). So you need to pass how = "inner" in your benchmark.Nikko
R
0

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)
Roughhew answered 23/10, 2015 at 12:20 Comment(1)
The question was about performance. You merely provided the syntax for the joins. While helpful, it doesn't answer the question. This answer lacks benchmark data using the OP's examples to show it performs better, or at least highly competitively.Kyne

© 2022 - 2024 — McMap. All rights reserved.