Faster way to achieve filtering for all possible combinations

Asked 12/8, 2019 at 8:7 Answered 13/8, 2019 at 5:29

Consider I have a data frame like this,

set.seed(1)

q<-100

df <- data.frame(Var1 = round(runif(q,1,50)),
                    Var2 = round(runif(q,1,50)),
                        Var3 = round(runif(q,1,50)),
                            Var4 = round(runif(q,1,50)))
attach(df)

As you realized, q is standing for setting the length of the each columns in the dataframe.

I want to make a filtering of all possible combinations of the columns. It can be anything. Let's say I am seeking for if the devision of the sums of the first two columns and the sums of the last two columns greater than 1 or not.

One thing to achieve that, using expand.grid() function.

a <- Sys.time()

expanded <- expand.grid(Var1, Var2, Var3, Var4)

Sys.time() - a

Time difference of 8.31997 secs


expanded  <- expanded[rowSums(expanded[,1:2])/ rowSums(expanded[,3:4])>1,]

However it takes a lot time! To make it faster, I tried to follow the answer with rep.int() function in this question and designed my own function.

myexpand <- function(...) {

 sapply(list(...),function(y) rep.int(y, prod(lengths(list(...)))/length(y)))

}

But it is not so promising again. It takes more time comparing to my expectation and the expand.grid also.And, If I set a greater q, it becomes a nigthmare!

Is there a proper way to achieve this a lot faster (1-2 seconds) with maybe matrix operations before applying expand.grid or myexpand . And, I wonder if it is a weakness of using an interpreted language like R... Software suggestions are also acceptable.

Empale answered 12/8, 2019 at 8:7 Comment(4)

Maybe try parallelization...? or maybe an Rcpp implementation? – Principalities 12/8, 2019 at 8:25

@Principalities I have no idea about parallelization and how to apply it in R. But I will have a look at. – Empale 12/8, 2019 at 8:29

Here is a link to get you started – Principalities 12/8, 2019 at 8:44

you might want to check out cran.r-project.org/web/packages/RcppAlgos/index.html – Chandigarh 13/8, 2019 at 5:24

For this particular condition (i.e. ratio of sums > 1), you might want to consider using the data.table package:

system.time({
    #generate permutations of Var1 & Var2 and Var3 & Var4
    DT12 <- DT[, CJ(Var1=Var1, Var2=Var2, unique=TRUE)][, s12 := Var1 + Var2]
    DT34 <- DT[, CJ(Var3=Var3, Var4=Var4, unique=TRUE)][, s34 := Var3 + Var4]

    #perform a non-equi join
    DT12[DT34, on=.(s12>s34), allow.cartesian=TRUE,
        .(Var1=x.Var1, Var2=x.Var2, Var3=i.Var3, Var4=i.Var4)][, s12:=NULL]
})

timing:

   user  system elapsed 
   0.02    0.06    0.08

output:

         Var1 Var2 Var3 Var4
      1:    2    5    2    4
      2:    4    3    2    4
      3:    5    2    2    4
      4:    2    6    2    4
      5:    4    4    2    4
     ---                    
1753416:   50   49   49   48
1753417:   50   50   49   48
1753418:   50   49   49   49
1753419:   50   50   49   49
1753420:   50   50   49   50

data:

library(data.table)
set.seed(1)
q <- 100
DT <- data.table(Var1 = round(runif(q,1,50)),
    Var2 = round(runif(q,1,50)),
    Var3 = round(runif(q,1,50)),
    Var4 = round(runif(q,1,50)))

edit: For summing of positive numbers, you can prob use the following (caveat: it will not be faster than using a Rcpp approach).

system.time({
    S <- DT[, .(UB=90 - Var1, C1=Var1)]
    for (k in 2:4) {
        S <- DT[S, on=paste0("Var", k, "<UB"), allow.cartesian=TRUE,
            mget(c(names(S), paste0("x.Var", k)))]
        setnames(S, paste0("x.Var", k), paste0("C", k))
        S[, UB := UB - get(paste0("C",k))]
    }
    S[, UB := NULL][rowSums(S)>30L]
})

timing:

   user  system elapsed 
   3.48    4.06    3.51

output, S:

> S
          C1 C2 C3 C4
       1: 14 33 14  6
       2: 14 33 14 25
       3: 14 33 14 24
       4: 14 33 14 19
       5: 14 33 14 10
      ---            
34914725: 31 39  3  8
34914726: 31 39  3  8
34914727: 31 39  3  9
34914728: 31 39  3 16
34914729: 31 39  3  8

Chandigarh answered 13/8, 2019 at 5:29 Comment(13)

I got the error object 'x.Var1' not found. Additionally, if I run nrow(expanded) after filtering by >1 or not, I get 55615910 but you found 1753420.And one thing more, the dataframe example is just a fake data created for make the question replicable.In reality, expand.grid function also works with different number of inputs. I.e. expand.grid(1:100,1:5,1:1000) . I don't really know if datatable is fine with list type inputs... – Empale 13/8, 2019 at 7:24

i am guessing that your solution has duplicates. data.table::CJ can take in any number of vectors as well – Chandigarh 13/8, 2019 at 7:33

I don't understand but I still get object 'x.Var1' not found error. And one more thing just to make it easier, what if we make the filtering say total sum of all variables greater than 30 and less than 90. If I apply this, DTnew <- DT[, CJ(Var1,Var2,Var3, Var4, unique=FALSE)][, Total := Var1+ Var2 +Var3 + Var4] . It takes more time comparing to expand.grid. Sorry I am not so familiar with DT. – Empale 13/8, 2019 at 22:22

I will take a look at the error later. For sum, you can use Joseph Wood’s RcppAlgos package – Chandigarh 13/8, 2019 at 22:24

I talked to Joseph Wood on this site under a question he answered. And he said he was sorry about the Cartesian Product.Unfortunately RcppAlgos package does not have an option for it but it is something that he is working on...He also saw this question as he said. By the way, thank you for your help and sorry for the delay of the answer. – Empale 13/8, 2019 at 22:28

@Empale i am not getting the error. Did you use your own data.frame or my data.table data? if u are using the former, can you set it to data.table using setDT(df) – Chandigarh 14/8, 2019 at 0:52

Sorry but I still get. And I used your DT input. It is happening

DT12[DT34, on=.(s12>s34), allow.cartesian=TRUE,         .(Var1=x.Var1, Var2=x.Var2, Var3=i.Var3, Var4=i.Var4), by=.EACHI][, s12:=NULL]

part.Anyway, I will upvote your answer.. – Empale 14/8, 2019 at 9:23

can you share your data.table version? – Chandigarh 14/8, 2019 at 10:5

packageVersion("data.table") gives 1.12.2 – Empale 14/8, 2019 at 10:9

so if u run the data code in a new R session and then the top code panel, you get the object 'x.Var1' not found? – Chandigarh 14/8, 2019 at 10:10

woah thanks. learned about streamable.com today. let me chew over it some more. maybe there might be some seed that can trigger the error. can you print DT12 and DT34? i added something to specify the names explictly. strange error – Chandigarh 14/8, 2019 at 10:24

@Chandigarh why the .EACHI? Since there is no aggregation, there's probably no memory saving using it, right...? – Redeemable 7/9, 2019 at 13:17

@Cole, thanks for that. Yeah not required. Not in front of computer to test though – Chandigarh 7/9, 2019 at 21:59

How about using tidyr and dplyr?

library(tidyr)
library(dplyr)

expanded <- 
  df %>% 
  expand(Var1, Var2, Var3, Var4)
  mutate(res = (Var1 + Var2)/(Var3 + Var4)) %>%
  filter(res > 1)

Note that expand works slightly differently from expand.grid - it only keeps the unique rows - so this may not be what you need.

On my computer, this all takes less than a second.

Penult answered 12/8, 2019 at 14:59 Comment(3)

How did you calculate expanded ? It is the real problematic part. – Empale 12/8, 2019 at 15:19

I misunderstood your question. I thought it was just the calculation and filtering step that you wanted to speed up. I've now edited my response to include the expansion step. – Penult 12/8, 2019 at 16:37

Thanks for your effort. I am really appreciated. However, this problem is not about using base R versus dplyr. It is much more complicated. And using pipe operator is not a solution. Because I am seeking for possible matrix operations before expanding which result in a real fast solution. Once you run the codes on your computer, you can realize the situation. If you follow my question step by step, and read the commands like parallelization suggestion, you will understand the real topic. If the %>% were the case, believe me, I could handle it without asking. Thank you very much. – Empale 12/8, 2019 at 16:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags