data table conditional cross join with two conditions
Asked Answered
O

1

1

I am trying to cross join two data tables, but with two conditions that I need to specify before the join, because if I do not, the memory limit kills it. (See this annoying feature of data table.)

Here is a smaller, reproducible example below:

    N=100
    J=50
    dat=data.table(CJ('t'=1:N,'a'=1:N,'j'=1:5))
    dat2 = data.table(CJ('j_prime'=1:J,'t_prime'=1:N))

In my real example, N is very large. I want to join so the output is this:

    datfinal = dat[, k:=(t+1)][dat2[, k:=t_prime], on=.(k), nomatch=0L,allow.cartesian=TRUE][,k:=NULL]

    datfinal = datfinal[(j<=j_prime & a!=1) | a==1]


> datfinal[(j<=j_prime & a!=1) | a==1]
          t   a j j_prime t_prime
      1:  1   1 1       1       2
      2:  1   1 2       1       2
      3:  1   1 3       1       2
      4:  1   1 4       1       2
      5:  1   1 5       1       2
     ---                         
2376986: 99 100 1      50     100
2376987: 99 100 2      50     100
2376988: 99 100 3      50     100
2376989: 99 100 4      50     100
2376990: 99 100 5      50     100

How do I do this more efficiently? I tried something like the below, but it gives weird results:

datfinal = dat[, k:=(t+1)][dat2[, k:=t_prime], on=.(k,j<=j_prime), nomatch=0L,allow.cartesian=TRUE][,k:=NULL]
Officialese answered 21/6, 2022 at 21:14 Comment(1)
If you join on j <= j_prime (second code block), there is no way to retain the other-half of your first-block conditional, | a == 1. Without that, you are discarding those rows too early.Cabot
B
0

I agree with @r2evans, you can't formulate all the restrictions prior to joining. But you could subset data.table dat on a==1 and a!=1, join twice with dat2 and bind together the two resultsets:

datfinalA = dat[a==1, k:=(t+1)][dat2[, k:=t_prime], on=.(k), nomatch=0L,allow.cartesian=TRUE, .(t, a, j, j_prime, t_prime)]
datfinalB = dat[a!=1, k:=(t+1)][dat2[, k:=t_prime], on=.(k, j <= j_prime), nomatch=0L, allow.cartesian=TRUE, .(t, a, j, j_prime, t_prime)]

datfinal = rbind(datfinalA, datfinalB)
Biserrate answered 22/6, 2022 at 7:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.