I am trying to populate a binary vector based on the intersection of two data.frames on multiple criteria.
I have the code working but I feel that it is memory excessive just to get the binary vector.
When I apply my code to my full data (40mm+ rows). I begin to have memory problems.
Is there a simpler way to produce the vector?
Here is some sample data (e.g., sub sample will only include obs. in full sample):
ob1_1 <- as.data.frame(cbind(c(1999),c("111","222","666","777")),stringsAsFactors=FALSE)
ob2_1 <- as.data.frame(cbind(c(2000),c("111","333","555","777")),stringsAsFactors=FALSE)
ob3_1 <- as.data.frame(cbind(c(2001),c("111","222","333","777")),stringsAsFactors=FALSE)
ob4_1 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)
full_sample <- rbind(ob1_1,ob2_1,ob3_1,ob4_1)
colnames(full_sample) <- c("yr","ID")
ob1_2 <- as.data.frame(cbind(c(1999),c("111","222","777")),stringsAsFactors=FALSE)
ob2_2 <- as.data.frame(cbind(c(2000),c("333")),stringsAsFactors=FALSE)
ob3_2 <- as.data.frame(cbind(c(2001),c("888")),stringsAsFactors=FALSE)
ob4_2 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)
sub_sample <- rbind(ob1_2,ob2_2,ob3_2,ob4_2)
colnames(sub_sample) <- c("yr","ID")
Here is my working code:
q_intersect <- ""
q_intersect <- paste(q_intersect , "select a.yr, a.ID ", sep=" ")
q_intersect <- paste(q_intersect , "from full_sample a ", sep=" ")
q_intersect <- paste(q_intersect , "intersect ", sep=" ")
q_intersect <- paste(q_intersect , "select b.yr, b.ID ", sep=" ")
q_intersect <- paste(q_intersect , "from sub_sample b ", sep=" ")
q_intersect <- trim(gsub(" {2,}", " ", q_intersect ))
intersect_temp <- cbind(sqldf(q_intersect ),1)
colnames(intersect_temp ) <- c("yr","ID","in_both")
q_expand <- ""
q_expand <- paste(q_expand , "select in_both ", sep=" ")
q_expand <- paste(q_expand , "from full_sample a ", sep=" ")
q_expand <- paste(q_expand , "left join intersect_temp b ", sep=" ")
q_expand <- paste(q_expand , "on a.yr=b.yr ", sep=" ")
q_expand <- paste(q_expand , "and a.ID=b.ID ", sep=" ")
q_expand <- trim(gsub(" {2,}", " ", q_expand ))
solution <- as.integer(sqldf(q_expand)[,1])
solution [is.na(solution )] <- 0
Thanks ahead of time for any help!
trim
function?) No, the firstsqldf
call locks it up. – Truismq_intersect
portion. Incidentally, Brad, in your previous question you were usingdata.table
and here you are usingdata.frame
. Is this deliberate? – Devilishdata.table
where votes>10 as well. To search for answers using data.table we can use "[r] -[data.table] data.table is:answer". – Pirouette