I am trying to cross join two data tables, but with two conditions that I need to specify before the join, because if I do not, the memory limit kills it. (See this annoying feature of data table.)
Here is a smaller, reproducible example below:
N=100
J=50
dat=data.table(CJ('t'=1:N,'a'=1:N,'j'=1:5))
dat2 = data.table(CJ('j_prime'=1:J,'t_prime'=1:N))
In my real example, N
is very large. I want to join so the output is this:
datfinal = dat[, k:=(t+1)][dat2[, k:=t_prime], on=.(k), nomatch=0L,allow.cartesian=TRUE][,k:=NULL]
datfinal = datfinal[(j<=j_prime & a!=1) | a==1]
> datfinal[(j<=j_prime & a!=1) | a==1]
t a j j_prime t_prime
1: 1 1 1 1 2
2: 1 1 2 1 2
3: 1 1 3 1 2
4: 1 1 4 1 2
5: 1 1 5 1 2
---
2376986: 99 100 1 50 100
2376987: 99 100 2 50 100
2376988: 99 100 3 50 100
2376989: 99 100 4 50 100
2376990: 99 100 5 50 100
How do I do this more efficiently? I tried something like the below, but it gives weird results:
datfinal = dat[, k:=(t+1)][dat2[, k:=t_prime], on=.(k,j<=j_prime), nomatch=0L,allow.cartesian=TRUE][,k:=NULL]
j <= j_prime
(second code block), there is no way to retain the other-half of your first-block conditional,| a == 1
. Without that, you are discarding those rows too early. – Cabot