Faster way to achieve filtering for all possible combinations
Asked Answered
E

2

5

Consider I have a data frame like this,

set.seed(1)

q<-100

df <- data.frame(Var1 = round(runif(q,1,50)),
                    Var2 = round(runif(q,1,50)),
                        Var3 = round(runif(q,1,50)),
                            Var4 = round(runif(q,1,50)))
attach(df)

As you realized, q is standing for setting the length of the each columns in the dataframe.

I want to make a filtering of all possible combinations of the columns. It can be anything. Let's say I am seeking for if the devision of the sums of the first two columns and the sums of the last two columns greater than 1 or not.

One thing to achieve that, using expand.grid() function.

a <- Sys.time()

expanded <- expand.grid(Var1, Var2, Var3, Var4)

Sys.time() - a

Time difference of 8.31997 secs


expanded  <- expanded[rowSums(expanded[,1:2])/ rowSums(expanded[,3:4])>1,]

However it takes a lot time! To make it faster, I tried to follow the answer with rep.int() function in this question and designed my own function.

myexpand <- function(...) {

 sapply(list(...),function(y) rep.int(y, prod(lengths(list(...)))/length(y)))

}

But it is not so promising again. It takes more time comparing to my expectation and the expand.grid also.And, If I set a greater q, it becomes a nigthmare!

Is there a proper way to achieve this a lot faster (1-2 seconds) with maybe matrix operations before applying expand.grid or myexpand . And, I wonder if it is a weakness of using an interpreted language like R... Software suggestions are also acceptable.

Empale answered 12/8, 2019 at 8:7 Comment(4)
Maybe try parallelization...? or maybe an Rcpp implementation?Principalities
@Principalities I have no idea about parallelization and how to apply it in R. But I will have a look at.Empale
Here is a link to get you startedPrincipalities
you might want to check out cran.r-project.org/web/packages/RcppAlgos/index.htmlChandigarh
C
3

For this particular condition (i.e. ratio of sums > 1), you might want to consider using the data.table package:

system.time({
    #generate permutations of Var1 & Var2 and Var3 & Var4
    DT12 <- DT[, CJ(Var1=Var1, Var2=Var2, unique=TRUE)][, s12 := Var1 + Var2]
    DT34 <- DT[, CJ(Var3=Var3, Var4=Var4, unique=TRUE)][, s34 := Var3 + Var4]

    #perform a non-equi join
    DT12[DT34, on=.(s12>s34), allow.cartesian=TRUE,
        .(Var1=x.Var1, Var2=x.Var2, Var3=i.Var3, Var4=i.Var4)][, s12:=NULL]
})

timing:

   user  system elapsed 
   0.02    0.06    0.08 

output:

         Var1 Var2 Var3 Var4
      1:    2    5    2    4
      2:    4    3    2    4
      3:    5    2    2    4
      4:    2    6    2    4
      5:    4    4    2    4
     ---                    
1753416:   50   49   49   48
1753417:   50   50   49   48
1753418:   50   49   49   49
1753419:   50   50   49   49
1753420:   50   50   49   50

data:

library(data.table)
set.seed(1)
q <- 100
DT <- data.table(Var1 = round(runif(q,1,50)),
    Var2 = round(runif(q,1,50)),
    Var3 = round(runif(q,1,50)),
    Var4 = round(runif(q,1,50)))

edit: For summing of positive numbers, you can prob use the following (caveat: it will not be faster than using a Rcpp approach).

system.time({
    S <- DT[, .(UB=90 - Var1, C1=Var1)]
    for (k in 2:4) {
        S <- DT[S, on=paste0("Var", k, "<UB"), allow.cartesian=TRUE,
            mget(c(names(S), paste0("x.Var", k)))]
        setnames(S, paste0("x.Var", k), paste0("C", k))
        S[, UB := UB - get(paste0("C",k))]
    }
    S[, UB := NULL][rowSums(S)>30L]
})

timing:

   user  system elapsed 
   3.48    4.06    3.51 

output, S:

> S
          C1 C2 C3 C4
       1: 14 33 14  6
       2: 14 33 14 25
       3: 14 33 14 24
       4: 14 33 14 19
       5: 14 33 14 10
      ---            
34914725: 31 39  3  8
34914726: 31 39  3  8
34914727: 31 39  3  9
34914728: 31 39  3 16
34914729: 31 39  3  8
Chandigarh answered 13/8, 2019 at 5:29 Comment(13)
I got the error object 'x.Var1' not found. Additionally, if I run nrow(expanded) after filtering by >1 or not, I get 55615910 but you found 1753420.And one thing more, the dataframe example is just a fake data created for make the question replicable.In reality, expand.grid function also works with different number of inputs. I.e. expand.grid(1:100,1:5,1:1000) . I don't really know if datatable is fine with list type inputs...Empale
i am guessing that your solution has duplicates. data.table::CJ can take in any number of vectors as wellChandigarh
I don't understand but I still get object 'x.Var1' not found error. And one more thing just to make it easier, what if we make the filtering say total sum of all variables greater than 30 and less than 90. If I apply this, DTnew <- DT[, CJ(Var1,Var2,Var3, Var4, unique=FALSE)][, Total := Var1+ Var2 +Var3 + Var4] . It takes more time comparing to expand.grid. Sorry I am not so familiar with DT.Empale
I will take a look at the error later. For sum, you can use Joseph Wood’s RcppAlgos packageChandigarh
I talked to Joseph Wood on this site under a question he answered. And he said he was sorry about the Cartesian Product.Unfortunately RcppAlgos package does not have an option for it but it is something that he is working on...He also saw this question as he said. By the way, thank you for your help and sorry for the delay of the answer.Empale
@Empale i am not getting the error. Did you use your own data.frame or my data.table data? if u are using the former, can you set it to data.table using setDT(df)Chandigarh
Sorry but I still get. And I used your DT input. It is happening DT12[DT34, on=.(s12>s34), allow.cartesian=TRUE, .(Var1=x.Var1, Var2=x.Var2, Var3=i.Var3, Var4=i.Var4), by=.EACHI][, s12:=NULL] part.Anyway, I will upvote your answer..Empale
can you share your data.table version?Chandigarh
packageVersion("data.table") gives 1.12.2Empale
so if u run the data code in a new R session and then the top code panel, you get the object 'x.Var1' not found?Chandigarh
woah thanks. learned about streamable.com today. let me chew over it some more. maybe there might be some seed that can trigger the error. can you print DT12 and DT34? i added something to specify the names explictly. strange errorChandigarh
@Chandigarh why the .EACHI? Since there is no aggregation, there's probably no memory saving using it, right...?Redeemable
@Cole, thanks for that. Yeah not required. Not in front of computer to test thoughChandigarh
P
0

How about using tidyr and dplyr?

library(tidyr)
library(dplyr)

expanded <- 
  df %>% 
  expand(Var1, Var2, Var3, Var4)
  mutate(res = (Var1 + Var2)/(Var3 + Var4)) %>%
  filter(res > 1)

Note that expand works slightly differently from expand.grid - it only keeps the unique rows - so this may not be what you need.

On my computer, this all takes less than a second.

Penult answered 12/8, 2019 at 14:59 Comment(3)
How did you calculate expanded ? It is the real problematic part.Empale
I misunderstood your question. I thought it was just the calculation and filtering step that you wanted to speed up. I've now edited my response to include the expansion step.Penult
Thanks for your effort. I am really appreciated. However, this problem is not about using base R versus dplyr. It is much more complicated. And using pipe operator is not a solution. Because I am seeking for possible matrix operations before expanding which result in a real fast solution. Once you run the codes on your computer, you can realize the situation. If you follow my question step by step, and read the commands like parallelization suggestion, you will understand the real topic. If the %>% were the case, believe me, I could handle it without asking. Thank you very much.Empale

© 2022 - 2024 — McMap. All rights reserved.