Efficient conditional cross join in data table
Asked Answered
F

1

0

I am trying to efficiently merge with a condition.

The way I am doing it now is to cross-join (which I want to preserve) except I have one condition for a subset of the columns.

Cross join function (from another question post):

CJ.table.1 <- function(X,Y)
      setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
    
set.seed(1)
#generate data
x = data.table(t=rep(1:10,2), z=sample(1:10,20,replace=T))
x2 = data.table(tprime=rep(1:10,2), zprime=sample(1:10,20,replace=T))

joined = CJ.table.1(x,x2)
> joined
      t  z tprime zprime
  1:  1  3      1     10
  2:  2  4      1     10
  3:  3  6      1     10
  4:  4 10      1     10
  5:  5  3      1     10
 ---                    
396:  6  5     10      5
397:  7  8     10      5
398:  8 10     10      5
399:  9  4     10      5
400: 10  8     10      5

Then I want to make sure t is increasing by 1 only.

setcolorder(joined, c("t", "tprime", "z",'zprime'))
joined=joined[tprime==t+1]

The final desired output is then:

> joined
    t tprime  z zprime
 1: 1      2  3      3
 2: 1      2  3      3
 3: 2      3  4      7
 4: 2      3  2      7
 5: 3      4  6      2
 6: 3      4  7      2
 7: 4      5 10      3
 8: 4      5  4      3
 9: 5      6  3      4
10: 5      6  8      4
11: 6      7  9      1
12: 6      7  5      1
13: 7      8 10      4
14: 7      8  8      4
15: 8      9  7      9
16: 8      9 10      9
17: 9     10  7      4
18: 9     10  4      4
19: 1      2  3      6
20: 1      2  3      6
21: 2      3  4      5
22: 2      3  2      5
23: 3      4  6      2
24: 3      4  7      2
25: 4      5 10      9
26: 4      5  4      9
27: 5      6  3      7
28: 5      6  8      7
29: 6      7  9      8
30: 6      7  5      8
31: 7      8 10      2
32: 7      8  8      2
33: 8      9  7      8
34: 8      9 10      8
35: 9     10  7      5
36: 9     10  4      5
    t tprime  z zprime

The reason I want to condition BEFORE the cross join is that the actual data I have is huge and therefore, it is inefficient to generate the entire thing first and THEN prune it down.

The reason I can't just do a merge is that I need to cross join the other rows as well.

Flowerage answered 11/3, 2019 at 22:59 Comment(16)
I’ve had similar issues with data.table and it would be very helpful to know the answer to thisPhylloquinone
An inner join is just a cross join followed by selection on a condition. Just write the inner join that is on the condition you want. (Cross join is inner join on true.)Wanids
for e.g. x[, k:=t+1][x2[, k:=tprime], on=.(k), nomatch=0L][, .(t, tprime, z, zprime)]Bramble
1.6 What are the benefits of being able to use column names as if they are variables inside DT[...]? "inside the square brackets, data.table sees this query as a whole before any part of it is evaluated. Thus it can optimize the combined query for performance. It can do this because the R language uniquely has lazy evaluation"Wanids
@chinsoon12, I think you need allowcartesian =T for this. Also, what about it you have multiple conditions? Particularly, suppose you have the t condition, but also a condition that says zprime must be some function of z (e.g. it must be less than the square)Flowerage
you can add allowcartesian=TRUE whenever u get that error message. to add more condition, use something along the line of x[, c('k', 'y1') := .(t+1, z)][x2[, c('k','y2') := .(tprime, myfun(zprime)], on=.(k=k, y1<y2)][, .(t, tprime, z, zprime)]Bramble
@chinsoon12, interesting. Suppose you have y1<y2, y3<y4, etc.. for an arbitrary number of y. How would you express this dynamically in that sense? something with Reduce?Flowerage
on works with character string as well. just don't leave any space e.g. on=c("y1<y2", "y3<y4"). you can see the help in ?data.tableBramble
@Bramble if you post the comment as an answer I am happy to award best answer. Additionally, I asked the multiple conditions inner join here: #55132730Flowerage
For large joins, I am getting this error: Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table... I have enough memory for sure, but is there any way around? Using R 3.5Flowerage
@Bramble any advice on the above comment?Flowerage
thats a huge number of rows. r u sure about it?Bramble
@Bramble I am sure about it being correct yesFlowerage
@Bramble I have more than enough memory as well.Flowerage
Any idea for a workaround if datatable won't let me @Bramble ? Reproducible example in this question: #72696300Flowerage
@Bramble started a bounty here #61091813Flowerage
P
0

You could create tprime = t+1 in x and join on it:

x[,tprime:=t+1]
x[x2,on=.(tprime),nomatch=0]

    t  z tprime zprime
 1: 1  6      2      6
 2: 1  4      2      6
 3: 2  2      3      1
 4: 2  8      3      1
 5: 3  8      4      4
 6: 3  2      4      4
 7: 4  3      5      4
 8: 4  1      5      4
 9: 5  6      6      6
10: 5  4      6      6
11: 6  6      7      2
12: 6  4      7      2
13: 7  9      8      3
14: 7  4      8      3
15: 8  7      9      1
16: 8  6      9      1
17: 9 10     10      1
18: 9 10     10      1
19: 1  6      2      5
20: 1  4      2      5
21: 2  2      3      4
22: 2  8      3      4
23: 3  8      4     10
24: 3  2      4     10
25: 4  3      5      9
26: 4  1      5      9
27: 5  6      6      4
28: 5  4      6      4
29: 6  6      7      6
30: 6  4      7      6
31: 7  9      8     10
32: 7  4      8     10
33: 8  7      9     10
34: 8  6      9     10
35: 9 10     10      6
36: 9 10     10      6
    t  z tprime zprime

or without modifying x:

x[,.(t,z,tprime=t+1)][
  x2,on=.(tprime),nomatch=0]

According to microbenchmark first solution seems a bit faster:

Unit: milliseconds
                                                                    expr    min      lq
 {     x[, `:=`(tprime, t + 1)]     x[x2, on = .(tprime), nomatch = 0] } 1.1060 1.21755
           x[, .(t, z, tprime = t + 1)][x2, on = .(tprime), nomatch = 0] 1.2967 1.43045
Unit: milliseconds
                                                                    expr    min     lq
 {     x[, `:=`(tprime, t + 1)]     x[x2, on = .(tprime), nomatch = 0] } 1.0921 1.1508
           x[, .(t, z, tprime = t + 1)][x2, on = .(tprime), nomatch = 0] 1.2688 1.3373
     mean  median      uq    max neval
 1.348345 1.22910 1.42265 2.8227   100
 1.521294 1.40295 1.59990 2.9437   100
Pinworm answered 27/6, 2022 at 6:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.