Summarize the self-join index while avoiding cartesian product in R data.table

Asked 27/2, 2015 at 9:39 Answered 15/7, 2015 at 16:33

With a 2-column data.table, I'd like to summarize the pairwise relationships in column 1 by summing the number of shared elements in column 2. In other words, how many shared Y elements does each pairwise combination of X-values have?

For example, I can do this in a 2-step process, first doing a cartesian cross join, then summarizing it like so:

d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setkey(d, Y)
d2 = d[d, allow.cartesian=TRUE]
d2[, .N, by=c("X", "i.X")]
 #  X i.X N
 #1: 1   1 3
 #2: 2   1 3
 #3: 3   1 1
 #4: 1   2 3
 #5: 2   2 4
 #6: 3   2 1
 #7: 1   3 1
 #8: 2   3 1
 #9: 3   3 3
#10: 4   2 1
#11: 2   4 1
#12: 4   4 2
#13: 4   3 1
#14: 3   4 1

The second row of this result indicates, that X=1 shares 3 Y-values with X=2; while X=3 shares only 1 y-value with X=4.

Is there any way to do this while bypassing the cartesian join step, which leads to large inefficient tables? I want to do something like this on a table with millions of rows, and the cartesian join runs into the 2^31 vector size limit (in addition to becoming slow).

I'm imagining something like this:

d[d, list(X, length(Y)), by=c("X", "i.X")]

But this gives the error i.X not found

I can do this in SQL with the code below -- but just can't figure out how to translate this into data.table syntax:

CREATE TABLE test (X integer, Y integer);
INSERT INTO test VALUES(1, 1);
INSERT INTO test VALUES(1, 2);
INSERT INTO test VALUES(1, 3);
INSERT INTO test VALUES(2, 1);
INSERT INTO test VALUES(2, 2);
INSERT INTO test VALUES(2, 3);
INSERT INTO test VALUES(2, 4);
INSERT INTO test VALUES(3, 1);
INSERT INTO test VALUES(3, 5);
INSERT INTO test VALUES(3, 6);
INSERT INTO test VALUES(4, 4);
INSERT INTO test VALUES(4, 5);

SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X;

The point is that the column I want to summarize is the same as the column I am joining on. This question is similar to these, but not exactly:

R Data.Table Join on Conditionals

How to self join a data.table on a condition

The key difference being that I want to summarize the index column, which seems impossible to do with by=.EACHI.

Allamerican answered 27/2, 2015 at 9:39 Comment(2)

what does your actual data look like - how many unique X's and Y's do you have? – Tessera 27/2, 2015 at 16:47

@Tessera -- well, for the large cases (where the cartesian join becomes a problem), it's on the order of 3,000 unique Ys, and 3,000 unique Xs, with about 2 million rows in the table d (combinations of X/Y). – Allamerican 27/2, 2015 at 20:11

If you can split your Y's into groups that don't have a large intersection of X's, you could do the computation by those groups first, resulting in a smaller intermediate table:

d[, grp := Y <= 3] # this particular split works best for OP data
d[, .SD[.SD, allow = T][, .N, by = .(X, i.X)], by = grp][,
    .(N = sum(N)), by = .(X, i.X)]

The intermediate table above has only 16 rows, as opposed to 26. Unfortunately I can't think of an easy way to create such grouping automatically.

Tessera answered 27/2, 2015 at 20:36 Comment(2)

Wow, there's some syntax for me to try to figure out -- thanks. I'd rather avoid the intermediate table completely if possible, but I'll look into this. – Allamerican 27/2, 2015 at 20:40

@sheffien I just did the exact same thing you did, but per grp first, and then summed the results (.SD is the piece of the original data.table corresponding to each grp). – Tessera 27/2, 2015 at 20:43

How about this one using foverlaps(). The more consecutive values of Y you've for each X, the lesser number of rows this'll produce compared to a cartesian join.

d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setorder(d, X)
d[, id := cumsum(c(0L, diff(Y)) != 1L), by=X]
dd = d[, .(start=Y[1L], end=Y[.N]), by=.(X,id)][, id := NULL][]

ans <- foverlaps(dd, setkey(dd, start, end))
ans[, count := pmin(abs(i.end-start+1L), abs(end-i.start+1L), 
                    abs(i.end-i.start+1L), abs(end-start+1L))]
ans[, .(count = sum(count)), by=.(X, i.X)][order(i.X, X)]
#     X i.X count
#  1: 1   1     3
#  2: 2   1     3
#  3: 3   1     1
#  4: 1   2     3
#  5: 2   2     4
#  6: 3   2     1
#  7: 4   2     1
#  8: 1   3     1
#  9: 2   3     1
# 10: 3   3     3
# 11: 4   3     1
# 12: 2   4     1
# 13: 3   4     1
# 14: 4   4     2

Note: make sure X and Y are integers for faster results. This is because joins on integer types are faster than on double types (foverlaps performs binary joins internally).

You can make this more memory efficient by using which=TRUE in foverlaps() and using the indices to generate count in the next step.

Haley answered 15/7, 2015 at 16:33 Comment(0)

You already have solution written in SQL so I suggest R package sqldf

Here's code:

library(sqldf)

result <- sqldf("SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X")

Crapulent answered 27/2, 2015 at 11:12 Comment(0)

Recommended topics

Hot tags