How can I perform full outer joins of large data sets in R?

Asked 6/6, 2013 at 14:30 Answered 15/10, 2014 at 13:46

Solved r bigdata outer-join sqldf ffbase

I am trying to do data analysis in R on a group of medium sized datasets. One of the analyses I need to do requires me to do a full outer join amongst around 24-48 files, each of with has about 60 columns and up to 450,000 lines. So I've been running into memory issues a lot.

I thought at ffbase or sqldf would help, but apparently full outer join is not possible with either of them.

Is there a workaround? A package I haven't found yet?

Nowicki answered 6/6, 2013 at 14:30 Comment(8)

Have a look into data.table. How much RAM do you have? – Blockbusting 6/6, 2013 at 14:37

Additionally, how is it you think that sqldf cannot do full outer joins? sqlite is slightly tricky, in that you have to do the left/right joins and then union them, but sqldf supports other db backends as well... – Dialectician 6/6, 2013 at 14:38

If you write in sqldf with a outer join statement, it hits a warning that says explicitly outer joins are not supported yet. – Nowicki 6/6, 2013 at 15:4

The message says that RIGHT and FULL outer joins aren't supported. There's one remaining, and it's all you need. Not to mention the fact that you still have other db backends available to you besides SQLite. – Dialectician 6/6, 2013 at 15:19

But with only 3-4 GB of RAM, I'm suddenly skeptical that you'll have much luck doing outer joins on 25+ tables each with hundreds of thousands of rows, no matter what the method is you use. – Dialectician 6/6, 2013 at 15:20

The RAM is not necessarily a constraint with a data base as long as you do it out of memory. With sqldf specify sqldf(..., dbname = tempfile()) . – Unaccustomed 6/6, 2013 at 16:18

The tempfile() should help with the memory problem, which is why I want to use sqldf. I'll give it a shot with LEFT OUTER JOIN and see if that's possible. – Nowicki 7/6, 2013 at 7:54

This works with LEFT OUTER JOIN, but I should perhaps post a new question. For a second task I still need FULL OUTER JOIN on the same files... – Nowicki 7/6, 2013 at 8:38

Here is a simple example that illustrates how to do outer joins of several datasets:

library(sqldf)
dat1 <- data.frame(x = 1:5,y = letters[1:5])
dat2 <- data.frame(w = 3:8,z = letters[3:8])
> 
> sqldf("select * from dat1 left outer join dat2 on dat1.x = dat2.w UNION 
+       select * from dat2 left outer join dat1 on dat1.x = dat2.w")
  x y  w    z
1 1 a NA <NA>
2 2 b NA <NA>
3 3 c  3    c
4 4 d  4    d
5 5 e  5    e
6 6 f NA <NA>
7 7 g NA <NA>
8 8 h NA <NA>

There it is, a full outer join using sqldf and SQLite as a backend.

As I also mentioned, sqldf support more back ends than SQLite. A Google search reveals that full outer joins are accomplished the exact same way in MySQL. I am less familiar with postgres but this question suggests that full outer joins are possible there as well.

Dialectician answered 8/6, 2013 at 20:1 Comment(4)

I did the above but somehow I am getting twice the no. of rows as opposed to merge(x = dat1, y = dat2, by.x = "x", by.y = "y", all = T). I have no idea why! – Sestina 1/12, 2017 at 15:56

This works only if there are no duplicate rows in either dat1 or dat2, which is the case here. UNION returns unique rows and would remove the duplicates. – Procreant 24/4, 2021 at 16:41

@Joran, before posting my comment I upvoted your answer because I found it useful and relevant in this case. However, I still think that my comment is helpful because if you use UNION ALL, you'll get the double amount of common rows which is probably not what is wanted, and makes it different than FULL OUTER JOIN. – Procreant 27/4, 2021 at 18:24

To demonstrate this, you can just run above code with UNION ALL => 11 rows , but expected is 8 – Procreant 27/4, 2021 at 18:30

Without sqldf, here is a smart and simple solution :

merge(a, b, by = "col", all = T)

Sour answered 15/10, 2014 at 13:46 Comment(0)

If you are using ffbase, you can get to your desired result of a full outer join if you combine expand.ffgrid with merge.ffdf. expand.ffgrid is like expand.grid but works with ff vectors so it will not overblow your RAM and merge.ffdf allows to merge with another ffdf without overblowing your RAM and storing data on disk. An example below.

require(ffbase)
x <- ffseq(1, 10000)
y <- ff(factor(LETTERS))
allcombinations <- expand.ffgrid(x, y)
addme <- data.frame(Var1 = c(1, 2), Var2 = c("A","B"), measure = rnorm(2))
addme <- as.ffdf(addme)
myffdf <- merge(allcombinations, addme, by.x=c("Var1","Var2"), by.y=c("Var1","Var2"),  all.x=TRUE)
myffdf[1:10,]

Next, look at delete rows ff package on how to subset that resulting myffdf.

Do have a look at ?ffbase::expand.ffgrid and ?ffbase::merge.ffdf

Sharitasharity answered 9/6, 2013 at 10:18 Comment(0)

This might work (note: the key column must be the first column in every dataset).

library(ff)
library(ffbase)

fullouterjoin <- function(ffdf1, ffdf2){

    # do a left outer join
    leftjoin <- merge(ffdf1, ffdf2, by = "key", all.x = TRUE)

    # do a right outer join (it's just a left outer join with the objects swapped)
    rightjoin <- merge(ffdf2, ffdf1, by = "key", all.x = TRUE)

    # swap the column orders (make ffd1 columns first and ffd2 columns later)
    srightjoin <- rightjoin[c(names(ffdf1), names(ffdf2)[2:length(ffdf2)])]

    # stack left outer join on top of the (swapped) right outer join
    stacked <- rbind(leftjoin, srightjoin)

    # remove duplicate rows
    uniques <- unique(stacked)

    # that's it
    return(uniques)
}

usage:

newffdf <- fullouterjoin(some_ffdf, another_ffdf)

I'm not saying it's fast, but it might overcome the memory barrier.

Amplification answered 29/11, 2013 at 4:27 Comment(2)

I'm working with much larger data sets. I've tried your method and I get "Error in row.names<-.data.frame(*tmp*, value = rownam[as.integer(i2)]) : duplicate 'row.names' are not allowed" upon executing the uniques <- unique(stacked) part. Do you have any idea where is this coming from? – Depravity 2/7, 2014 at 12:51

Sorry, no idea (I haven't touched R since then). – Amplification 2/7, 2014 at 18:16

Recommended topics

Hot tags