R intersect data.frame on multiple criteria

Asked 24/3, 2013 at 4:42 Answered 24/3, 2013 at 5:31

I am trying to populate a binary vector based on the intersection of two data.frames on multiple criteria.

I have the code working but I feel that it is memory excessive just to get the binary vector.

When I apply my code to my full data (40mm+ rows). I begin to have memory problems.

Is there a simpler way to produce the vector?

Here is some sample data (e.g., sub sample will only include obs. in full sample):

ob1_1 <- as.data.frame(cbind(c(1999),c("111","222","666","777")),stringsAsFactors=FALSE)
ob2_1 <- as.data.frame(cbind(c(2000),c("111","333","555","777")),stringsAsFactors=FALSE)
ob3_1 <- as.data.frame(cbind(c(2001),c("111","222","333","777")),stringsAsFactors=FALSE)
ob4_1 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)

full_sample <-  rbind(ob1_1,ob2_1,ob3_1,ob4_1)
colnames(full_sample) <- c("yr","ID")

ob1_2 <- as.data.frame(cbind(c(1999),c("111","222","777")),stringsAsFactors=FALSE)
ob2_2 <- as.data.frame(cbind(c(2000),c("333")),stringsAsFactors=FALSE)
ob3_2 <- as.data.frame(cbind(c(2001),c("888")),stringsAsFactors=FALSE)
ob4_2 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)

sub_sample <-  rbind(ob1_2,ob2_2,ob3_2,ob4_2)
colnames(sub_sample) <- c("yr","ID")

Here is my working code:

q_intersect <- ""
q_intersect <- paste(q_intersect , "select       a.yr, a.ID       ", sep=" ")
q_intersect <- paste(q_intersect , "from         full_sample a  ", sep=" ")
q_intersect <- paste(q_intersect , "intersect                     ", sep=" ")
q_intersect <- paste(q_intersect , "select       b.yr, b.ID       ", sep=" ")
q_intersect <- paste(q_intersect , "from         sub_sample b  ", sep=" ")
q_intersect <- trim(gsub(" {2,}", " ", q_intersect ))

intersect_temp <- cbind(sqldf(q_intersect ),1)
colnames(intersect_temp ) <- c("yr","ID","in_both")

q_expand <- ""
q_expand <- paste(q_expand , "select       in_both            ", sep=" ")
q_expand <- paste(q_expand , "from         full_sample a      ", sep=" ")
q_expand <- paste(q_expand , "left join    intersect_temp  b  ", sep=" ")
q_expand <- paste(q_expand , "on           a.yr=b.yr          ", sep=" ")
q_expand <- paste(q_expand , "and          a.ID=b.ID          ", sep=" ")
q_expand <- trim(gsub(" {2,}", " ", q_expand ))

solution <- as.integer(sqldf(q_expand)[,1])
solution [is.na(solution )] <- 0

Thanks ahead of time for any help!

Efferent answered 24/3, 2013 at 4:42 Comment(3)

You should explain what you want in natural language. (The code you say is working just locked up my session of R, possibly because there is no trim function?) No, the first sqldf call locks it up. – Truism 24/3, 2013 at 4:53

@Dwin, I just had the same lockup issue. But only within the first 4 lines of the q_intersect portion. Incidentally, Brad, in your previous question you were using data.table and here you are using data.frame. Is this deliberate? – Devilish 24/3, 2013 at 4:57

@RicardoSaporta Hi. Have just untagged data.table. See discussion in chatroom yesterday which corrected me. Might do a larger untagging sweep. The SO view is that tags are for questions not answers. With fresh eyes, I now see and agree that the search "[data.table] data.table is:question" is what "[data.table]" alone should look like. However it seems ok (see chat) to tag accepted answers using data.table where votes>10 as well. To search for answers using data.table we can use "[r] -[data.table] data.table is:answer". – Pirouette 27/3, 2013 at 10:56

It's not altogether clear what you are trying to accomplish, but I believe something like this would be a lot simpler.

library(data.table)
fullDT <- data.table(full_sample, key=c("yr", "ID"))
subDT  <- data.table(sub_sample,  key=c("yr", "ID"))

fullDT[ , intersect := 0L]
fullDT[subDT, intersect := 1, nomatch=0]

The idea is that you set the key of each data.table to be the columns you want to intersect. When you call full[sub], nomatch=0] you get your inner join, and we set only those values to 1; the values not identified in the inner join are left as 0, as set in the line prior.

fullDT
#        yr  ID intersect
#   1: 1999 111         1
#   2: 1999 222         1
#   3: 1999 666         0
#   4: 1999 777         1
#   5: 2000 111         0
#   6: 2000 333         1
#   7: 2000 555         0
#   8: 2000 777         0
#   9: 2001 111         0
#  10: 2001 222         0
#  11: 2001 333         0
#  12: 2001 777         0
#  13: 2002 111         1
#  14: 2002 444         1
#  15: 2002 555         1
#  16: 2002 777         1

Devilish answered 24/3, 2013 at 5:3 Comment(6)

Ricardo, this looks good. I was using data.frames to create the sample data, but I prefer data.tables when possible. – Efferent 24/3, 2013 at 5:14

This is definitely much simpler than my approach. Thanks you again! – Efferent 24/3, 2013 at 5:15

@Brad, no prob at all. I was asking in my comments above, simply because it makes the problem a lot easier. – Devilish 24/3, 2013 at 5:21

You could speed this up by setting intersect :=0 for the whole dataset (before setting the subset that joins to 1). – Disinterested 25/3, 2013 at 2:20

@RicardoSaporta -- you might need to reword your description slightly as well. – Disinterested 25/3, 2013 at 3:37

@mnel, thanks again. Just edited (I'm working form my iPad, but it should be all correct now) – Devilish 25/3, 2013 at 4:34

Simpler SQL I gather that you wish to create a one column data frame with the same number of rows as full_sample such that a given row in the output contains 1 if the corresponding row in full_sample has a matching sub_sample row and 0 otherwise.

In that case, the multiple SQL statements can be condensed into a single simpler SQL statement as shown below. The left join ensures that all rows of full_sample are included and the natural join causes the join to occur on all column names that are common between the two input data frames.

sqldf("select s.yr is not null as solution 
       from full_sample f natural left join sub_sample s")

(By the way, note that string literals can flow over multiple lines as this shows so its not necessary to paste multiple lines together.)

Out of Memory Database sqldf by default uses an in memory database but you can specify a file name (which need not exist ahead of time) via the dbname= argument to use as as out of memory database. In that case you won't be limited by memory.

sqldf("select s.yr is not null as solution 
       from full_sample f natural left join sub_sample s", dbname = "mydb")

(Also you can improve performance in some cases by using indexes. See the sqldf home page for examples.)

UPDATE: added simpler sql solution

Referent answered 24/3, 2013 at 5:31 Comment(0)

Recommended topics

Hot tags