Data.table self-join on condition using a matrix
Asked Answered
S

1

1

I am trying to do a join of a data.table with itself. The condition to join is based on the value of a column (not the key) being used to access a matrix. Each row has a date(in seconds) and records should only join with newest records (t1

cn<-unique(sdd$column)
mat<-matrix(data=0,nrow=lde,ncol=lde,dimnames=list(cn,cn))

I am struggling with the documentation FAQ (including the SQL to data.table analogy) and the Beginner's Guide for data.table and multiple similar questions in this forum but I can't find how to solve it.

Q1, Q2, Q3

I get stuck with notation problems.

My DT is sdd:

> colnames(sdd)
[1] "ID" "DelayStartTimeSeconds" "DelayEndTimeSeconds" "EquipmentID"

I have made several attempts such as:

sdd2<-sdd # to avoid problems with the names of variables
sdd[sdd2,eqXrossM[cbind(sdd.EquipmentID,sdd2.EquipmentID)]==1 & sdd.DelayEndTimeSeconds<sdd2.DelayStartTimeSeconds, distance:=sdd.DelayEndTimeSeconds-sdd2.DelayStartTimeSeconds][,distance:=sdd.DelayEndTimeSeconds<sdd2.DelayStartTimeSeconds] # that would be the whole thing to do generating a new column with the time difference

sdd[sdd2[ sdd.DelayEndTimeSeconds<sdd2.DelayStartTimeSeconds, distance:=sdd.DelayEndTimeSeconds<sdd2.DelayStartTimeSeconds]] #this is an approximation attempt.

I simply don't get the notation and the different examples seem to use different notations.

EDIT: Well, after a night of sleep some stuff is making sense... but other is still confusing just the same. For example:

sdd_x<-sdd[sdd2,i.DelayStartTimeSeconds>DelayEndTimeSeconds] # returns a vector as long as sdd: len
sdd_x<-sdd[sdd2,i.DelayStartTimeSeconds>DelayEndTimeSeconds & eqXrossM[i.EquipmentID,EquipmentID]==1] # returns a matrix len x len.

Why adding a new condition changes the type of output? I was expecting a case such as the matrix (that would require optimization) In addition the whole matrix is false what is not the expected value as the records are different. In fact for the second case, either the upper or lower diagonal should be TRUE.

Also, looks like the call to the matrix doesn't require using cbind as other answer to a similar question mentioned. Why that?

And my last discovery has been finding out the CJ() operator but trying to use the i. notation doesn't work here. This part doesn't seem too documented.

sdd[CJ(DASDelayID,DASDelayID),i.DelayStartTimeSeconds>DelayEndTimeSeconds]

Any help would be appreciated.

Slaty answered 17/3, 2015 at 13:54 Comment(5)
add a simple reproducible example that illustrates your confusionWinded
@eddi, I have shown a couple of attempts to code it, I am getting errors sdd.EndTime not found. So I really don't know how to refer to the columns for each of the DT in the join and definitely I think the trick of having the DT with a second name is unnecessary. How should I refer to the variables in the first and second DT?Slaty
use i. to refer to the i-expression data.table, e.g. dt1[dt2, col.from.dt1 := i.col.from.dt2]Winded
@eddi, could you give me any more hints to continue solving the problem? I am almost there... Thanks in advanceSlaty
I'm afraid I don't know what you're doing or even trying to do. It does look like you're a bit confused about R's recycling though - check out the output of 1:5 * matrix(1:25, nrow = 5) - once you understand what happens there it should at least clear out some of your confusion (also search "R recycling" on google).Winded
S
0

This is how I finally solved the problem:

sddx<<-CJ(ID1=sdd$DASDelayID,ID2=sdd$DASDelayID)[
    ID1<ID2] [,
              ':='(Connected=eqXrossM[cbind(sdd[DASDelayID==ID2,EquipmentID],sdd[DASDelayID==ID1,EquipmentID])]==1,
                   Distance=as.integer(sdd[DASDelayID==ID2,DelayStartTimeSeconds]-sdd[DASDelayID==ID1,DelayEndTimeSeconds]))
              ]

Step by step:

Generate all the combinations of DelayID, the number is large but each row has only two columns integers.

sddx<<-CJ(ID1=sdd$DASDelayID,ID2=sdd$DASDelayID) 

This cuts the size to half, since ID1 are given as they are created, ordered by DelayStartTime and DelayEndTime>DelayStartTime.

[ID1<ID2] 

This enforces the external condition accessing the matrix, note the cbind:

[,':='(Connected=eqXrossM[cbind(sdd[DASDelayID==ID2,EquipmentID],sdd[DASDelayID==ID1,EquipmentID])]==1,

This calculates the distance between Delays, that can be used to filter the ones where it is not strictly positive

Distance=as.integer(sdd[DASDelayID==ID2,DelayStartTimeSeconds]-sdd[DASDelayID==ID1,DelayEndTimeSeconds]))  
              ]

I hope it helps someone else.

Slaty answered 19/3, 2015 at 1:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.