How can I perform full outer joins of large data sets in R?
Asked Answered
N

4

5

I am trying to do data analysis in R on a group of medium sized datasets. One of the analyses I need to do requires me to do a full outer join amongst around 24-48 files, each of with has about 60 columns and up to 450,000 lines. So I've been running into memory issues a lot.

I thought at ffbase or sqldf would help, but apparently full outer join is not possible with either of them.

Is there a workaround? A package I haven't found yet?

Nowicki answered 6/6, 2013 at 14:30 Comment(8)
Have a look into data.table. How much RAM do you have?Blockbusting
Additionally, how is it you think that sqldf cannot do full outer joins? sqlite is slightly tricky, in that you have to do the left/right joins and then union them, but sqldf supports other db backends as well...Dialectician
If you write in sqldf with a outer join statement, it hits a warning that says explicitly outer joins are not supported yet.Nowicki
The message says that RIGHT and FULL outer joins aren't supported. There's one remaining, and it's all you need. Not to mention the fact that you still have other db backends available to you besides SQLite.Dialectician
But with only 3-4 GB of RAM, I'm suddenly skeptical that you'll have much luck doing outer joins on 25+ tables each with hundreds of thousands of rows, no matter what the method is you use.Dialectician
The RAM is not necessarily a constraint with a data base as long as you do it out of memory. With sqldf specify sqldf(..., dbname = tempfile()) .Unaccustomed
The tempfile() should help with the memory problem, which is why I want to use sqldf. I'll give it a shot with LEFT OUTER JOIN and see if that's possible.Nowicki
This works with LEFT OUTER JOIN, but I should perhaps post a new question. For a second task I still need FULL OUTER JOIN on the same files...Nowicki
D
10

Here is a simple example that illustrates how to do outer joins of several datasets:

library(sqldf)
dat1 <- data.frame(x = 1:5,y = letters[1:5])
dat2 <- data.frame(w = 3:8,z = letters[3:8])
> 
> sqldf("select * from dat1 left outer join dat2 on dat1.x = dat2.w UNION 
+       select * from dat2 left outer join dat1 on dat1.x = dat2.w")
  x y  w    z
1 1 a NA <NA>
2 2 b NA <NA>
3 3 c  3    c
4 4 d  4    d
5 5 e  5    e
6 6 f NA <NA>
7 7 g NA <NA>
8 8 h NA <NA>

There it is, a full outer join using sqldf and SQLite as a backend.

As I also mentioned, sqldf support more back ends than SQLite. A Google search reveals that full outer joins are accomplished the exact same way in MySQL. I am less familiar with postgres but this question suggests that full outer joins are possible there as well.

Dialectician answered 8/6, 2013 at 20:1 Comment(4)
I did the above but somehow I am getting twice the no. of rows as opposed to merge(x = dat1, y = dat2, by.x = "x", by.y = "y", all = T). I have no idea why!Sestina
This works only if there are no duplicate rows in either dat1 or dat2, which is the case here. UNION returns unique rows and would remove the duplicates.Procreant
@Joran, before posting my comment I upvoted your answer because I found it useful and relevant in this case. However, I still think that my comment is helpful because if you use UNION ALL, you'll get the double amount of common rows which is probably not what is wanted, and makes it different than FULL OUTER JOIN.Procreant
To demonstrate this, you can just run above code with UNION ALL => 11 rows , but expected is 8Procreant
S
4

Without sqldf, here is a smart and simple solution :

merge(a, b, by = "col", all = T)

FX

Sour answered 15/10, 2014 at 13:46 Comment(0)
S
0

If you are using ffbase, you can get to your desired result of a full outer join if you combine expand.ffgrid with merge.ffdf. expand.ffgrid is like expand.grid but works with ff vectors so it will not overblow your RAM and merge.ffdf allows to merge with another ffdf without overblowing your RAM and storing data on disk. An example below.

require(ffbase)
x <- ffseq(1, 10000)
y <- ff(factor(LETTERS))
allcombinations <- expand.ffgrid(x, y)
addme <- data.frame(Var1 = c(1, 2), Var2 = c("A","B"), measure = rnorm(2))
addme <- as.ffdf(addme)
myffdf <- merge(allcombinations, addme, by.x=c("Var1","Var2"), by.y=c("Var1","Var2"),  all.x=TRUE)
myffdf[1:10,]

Next, look at delete rows ff package on how to subset that resulting myffdf.

Do have a look at ?ffbase::expand.ffgrid and ?ffbase::merge.ffdf

Sharitasharity answered 9/6, 2013 at 10:18 Comment(0)
A
0

This might work (note: the key column must be the first column in every dataset).

library(ff)
library(ffbase)

fullouterjoin <- function(ffdf1, ffdf2){

    # do a left outer join
    leftjoin <- merge(ffdf1, ffdf2, by = "key", all.x = TRUE)

    # do a right outer join (it's just a left outer join with the objects swapped)
    rightjoin <- merge(ffdf2, ffdf1, by = "key", all.x = TRUE)

    # swap the column orders (make ffd1 columns first and ffd2 columns later)
    srightjoin <- rightjoin[c(names(ffdf1), names(ffdf2)[2:length(ffdf2)])]

    # stack left outer join on top of the (swapped) right outer join
    stacked <- rbind(leftjoin, srightjoin)

    # remove duplicate rows
    uniques <- unique(stacked)

    # that's it
    return(uniques)
}

usage:

newffdf <- fullouterjoin(some_ffdf, another_ffdf)

I'm not saying it's fast, but it might overcome the memory barrier.

Amplification answered 29/11, 2013 at 4:27 Comment(2)
I'm working with much larger data sets. I've tried your method and I get "Error in row.names<-.data.frame(*tmp*, value = rownam[as.integer(i2)]) : duplicate 'row.names' are not allowed" upon executing the uniques <- unique(stacked) part. Do you have any idea where is this coming from?Depravity
Sorry, no idea (I haven't touched R since then).Amplification

© 2022 - 2024 — McMap. All rights reserved.