With a 2-column data.table
, I'd like to summarize the pairwise relationships in column 1 by summing the number of shared elements in column 2. In other words, how many shared Y elements does each pairwise combination of X-values have?
For example, I can do this in a 2-step process, first doing a cartesian cross join, then summarizing it like so:
d = data.table(X=c(1,1,1,2,2,2,2,3,3,3,4,4), Y=c(1,2,3,1,2,3,4,1,5,6,4,5))
setkey(d, Y)
d2 = d[d, allow.cartesian=TRUE]
d2[, .N, by=c("X", "i.X")]
# X i.X N
#1: 1 1 3
#2: 2 1 3
#3: 3 1 1
#4: 1 2 3
#5: 2 2 4
#6: 3 2 1
#7: 1 3 1
#8: 2 3 1
#9: 3 3 3
#10: 4 2 1
#11: 2 4 1
#12: 4 4 2
#13: 4 3 1
#14: 3 4 1
The second row of this result indicates, that X=1
shares 3 Y-values with X=2
; while X=3
shares only 1 y-value with X=4
.
Is there any way to do this while bypassing the cartesian join step, which leads to large inefficient tables? I want to do something like this on a table with millions of rows, and the cartesian join runs into the 2^31
vector size limit (in addition to becoming slow).
I'm imagining something like this:
d[d, list(X, length(Y)), by=c("X", "i.X")]
But this gives the error i.X not found
I can do this in SQL with the code below -- but just can't figure out how to translate this into data.table syntax:
CREATE TABLE test (X integer, Y integer);
INSERT INTO test VALUES(1, 1);
INSERT INTO test VALUES(1, 2);
INSERT INTO test VALUES(1, 3);
INSERT INTO test VALUES(2, 1);
INSERT INTO test VALUES(2, 2);
INSERT INTO test VALUES(2, 3);
INSERT INTO test VALUES(2, 4);
INSERT INTO test VALUES(3, 1);
INSERT INTO test VALUES(3, 5);
INSERT INTO test VALUES(3, 6);
INSERT INTO test VALUES(4, 4);
INSERT INTO test VALUES(4, 5);
SELECT A.X, B.X, COUNT(A.Y) as N FROM test as A JOIN test as B WHERE A.Y==B.Y GROUP BY A.X, B.X;
The point is that the column I want to summarize is the same as the column I am joining on. This question is similar to these, but not exactly:
R Data.Table Join on Conditionals
How to self join a data.table on a condition
The key difference being that I want to summarize the index column, which seems impossible to do with by=.EACHI.
X
's andY
's do you have? – Tessera