I tested the five functions I listed in my question (as @r2evans suggested). I used five different datasets, because I thought there might be a difference in performance depending on whether the vector pairs are mostly disjoint or mostly non-disjoint. (It turns out, there's not much difference with EIC.1 through EIC.4; as for EIC.5, it runs slower if most of the pairs are disjoint.)
Here's how I generated the datasets:
n=1400L
a1 <- replicate(n, sample(5000000L, 500L, replace = TRUE), simplify = FALSE)
b1 <- replicate(n, sample(5000000L, 2500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, to be compared pairwise, where about 22% of the pairs have elements in common
a2 <- replicate(n, sample(800000L, 500L, replace = TRUE), simplify = FALSE)
b2 <- replicate(n, sample(800000L, 2500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, to be compared pairwise, where about 79% of the pairs have elements in common
a3 <- replicate(n, sample(3250000L, 1500L, replace = TRUE), simplify = FALSE)
b3 <- replicate(n, sample(3250000L, 1500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, equal in length, to be compared pairwise, where about 50% of the pairs have elements in common
And here are my results:
library(microbenchmark)
LL <- c(expression(sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]]))),
expression(sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]]))),
expression(sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]]))),
expression(sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]]))),
expression(sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]]))) )
v1 <- a1
v2 <- b1
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 110.59374 110.98621 113.5366 112.52576 114.4162 130.0801 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 97.18203 97.64194 101.4938 99.20129 101.6032 158.8913 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 96.98262 98.73502 100.5121 99.06029 100.6465 136.2520 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 255.85385 256.67103 262.0515 258.23332 265.1787 291.9498 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 230.49910 231.25642 236.2385 233.05208 237.7731 280.7453 100
v1 <- a2
v2 <- b2
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 112.40455 112.78578 114.8205 114.4925 114.9898 126.2302 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 98.45717 98.87847 101.7272 100.5070 101.0258 134.8737 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 98.15024 98.59084 101.1340 100.2553 101.2907 131.4896 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 258.48673 259.18759 264.2449 260.1710 265.2686 307.0624 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 200.79988 201.52592 205.8434 203.3817 207.2203 244.2715 100
v1 <- a3
v2 <- b3
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 134.0820 134.5529 135.4400 134.6922 135.6203 142.1575 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 119.7959 120.1119 122.3887 120.2729 122.2338 158.0306 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 119.7705 120.2145 122.3458 121.9361 122.4224 150.4227 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 257.0928 259.0730 263.2403 259.6671 263.7227 318.9604 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 226.4821 227.0798 230.2878 228.4882 231.3292 258.4599 100
v1 <- b1 # the longer vector is now vec1
v2 <- a1
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 199.2799 201.3817 202.5054 201.6378 202.7534 214.8660 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 187.5226 187.9299 188.9177 188.1184 189.8541 196.1020 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 187.8891 188.3417 190.5641 190.1809 190.8307 219.4735 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 255.1007 255.8905 260.1282 256.8316 262.1560 288.4900 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 237.7409 238.4515 241.5251 239.9415 243.5631 266.5916 100
v1 <- b2
v2 <- a2
microbenchmark(list=LL)
Unit: milliseconds
expr min lq mean median uq max neval
sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 198.8747 201.2476 202.1573 201.5215 202.3886 207.7772 100
sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 185.5260 185.7983 187.8099 185.9842 188.3947 225.7553 100
sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 185.8022 186.1824 188.8937 187.9226 188.6763 221.2442 100
sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 257.6607 258.5063 262.3677 259.6778 264.6313 304.4813 100
sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 230.5553 231.3261 233.9914 232.9138 235.0349 260.4950 100
In all cases, EIC.2 and EIC.3 are fastest (and very close to each other), with EIC.1 not far behind. But notice that both of them are much more efficient if the shorter vector is first. For example, where vec1
is a1
(length 500) and vec2
is b1
(length 2500), EIC.2 has a median of 99 milliseconds. But when I switch them so vec1
is b1
and vec2
is a1
, EIC.2 slows down to 188 milliseconds. So for greater efficiency, it's worth checking which vector is longer, before calling EIC.2. (Or else re-writing EIC.2 so that it's always testing [shorter vector] %in%
[longer vector].)
microbenchmark
showed me that 3 and 2 were very close, and both were an order of magnitude faster than 5, and really 1,2,5 are all close to the same performance. I'm inferring that you have larger data in mind, though, so your own benchmark could easily bring up other nuances of the code. – Lesley