R - Efficient way to test whether a pair of vectors is disjoint
Asked Answered
M

2

9

I want to know if two vectors have any elements in common. I don't care what the elements are, how many common elements there are, or what positions they are at within either vector. I just need a simple, efficient function EIC(vec1, vec2) that returns TRUE if there exists some element in both vec1 and vec2, FALSE if there are no elements common to both. Also we can assume that neither vec1 nor vec2 contain NA, but either may have duplicated values.

I've thought of five ways to do this, but they all seem inefficient:

EIC.1 <- function(vec1, vec2) length(intersect(vec1, vec2)) > 0
# I want a function that will stop when it finds the first 
# common element between the vectors, and return TRUE. The
# intersect function will continue on and check whether there are
# any other common elements.

EIC.2 <- function(vec1, vec2) any(vec1 %in% vec2)

EIC.3 <- function(vec1, vec2) any(!is.na(match(vec1, vec2)))
# the match function goes to the trouble of finding the position
# of all matches; I don't need the position but just want to know
# if any exist

EIC.4 <- function(vec1, vec2) {
      uvec1 <- unique(vec1)
      uvec2 <- unique(vec2)
      length(unique(c(uvec1, uvec2))) < length(uvec1) + length(uvec2)
}

EIC.5 <- function(vec1, vec2) !!anyDuplicated(c(unique(vec1), unique(vec2)))
# per https://mcmap.net/q/965292/-how-to-test-whether-a-vector-contains-repetitive-elements#comment5931428_5263593
# I suspect this is the most efficient of the five, because
# anyDuplicated will stop looking when it comes to the first one,
# but I'm not sure about using !! to coerce to boolean type

I will be using very long vectors (without any NAs, as previously mentioned) and will be running this function millions of times, which is why I am looking for something efficient. Here is some test data:

v1 <- c(9, 8, 75, 62)
v2 <- c(20, 75, 341, 987, 8)
v3 <- c(154, 62, 62, 143, 154, 95)
v4 <- c(12, 62, 12)

EIC <- EIC.1

EIC(v1, v2)
EIC(v1, v3)
EIC(v1, v4)
EIC(v2, v3)
EIC(v2, v4)
EIC(v3, v4)

Correct results are TRUE, TRUE, TRUE, FALSE, FALSE, TRUE.

Meekins answered 23/10, 2018 at 4:25 Comment(3)
Seems like this is more for CodeReview, since the code is functioning alright. Regardless, a quick microbenchmark showed me that 3 and 2 were very close, and both were an order of magnitude faster than 5, and really 1,2,5 are all close to the same performance. I'm inferring that you have larger data in mind, though, so your own benchmark could easily bring up other nuances of the code.Lesley
I think you mean "1,2,3 are all close to the same performance" (not "1,2,5"). I microbenchmarked the five functions on five different datasets (which I'll describe in more detail in an answer) and found that EIC.2 & EIC.3 were always the fastest (and very close to each other, as you said), with EIC.1 close behind, then EIC.5, and EIC.4 the slowest. The takeaway is, I think, use EIC.2 since it's more readable than EIC.3. However, if the vectors are different lengths, EIC.2 is more efficient if the shorter vector is first, so rearrange the vectors before calling the function.Meekins
It depends heavily on which vectors are passed to the functions. I don't recall off-hand which ones I used for my quick benchmarking (I didn't save it), but "no", I meant what I typed, I recall that 1,2,5 were similar in the test I ran. But in the end it depends so much on the vectors that arguing one way or the other without more context/perspective is not that productive. (I did not go to the lengths that you did in your answer, it's no surprise your results are different.) Good luck!Lesley
G
2

Not really an answer, just some comments:

  • EIC.1, EIC.2 and EIC.3 all use match() at some point:
  • EIC.1 has some extra overhead so is a bit slower,
intersect <- function (x, y) 
{
    y <- as.vector(y)
    unique(y[match(as.vector(x), y, 0L)])
}
  • EIC.2 use %in% and as such is very close to EIC.3
`%in%` <- function (x, table) match(x, table, nomatch = 0L) > 0L

You could shave a bit of time on some cases with this:

EIC.all <- function(vec1, vec2) !all(is.na(match(vec1, vec2)))

because the negation ! is performed on a scalar instead of a vector of size length(vec1).

What you need is a C/C++ function that does the exact same thing as the match internal function but stops at the first match. You could have a look at the mach5 C function: https://github.com/wch/r-source/blob/d1f8ef492464fd68320be9581bde4b09eadc03d6/src/main/unique.c#L1332

Gynophore answered 2/2, 2022 at 19:39 Comment(0)
M
0

I tested the five functions I listed in my question (as @r2evans suggested). I used five different datasets, because I thought there might be a difference in performance depending on whether the vector pairs are mostly disjoint or mostly non-disjoint. (It turns out, there's not much difference with EIC.1 through EIC.4; as for EIC.5, it runs slower if most of the pairs are disjoint.)

Here's how I generated the datasets:

n=1400L

a1 <- replicate(n, sample(5000000L, 500L, replace = TRUE), simplify = FALSE)
b1 <- replicate(n, sample(5000000L, 2500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, to be compared pairwise, where about 22% of the pairs have elements in common

a2 <- replicate(n, sample(800000L, 500L, replace = TRUE), simplify = FALSE)
b2 <- replicate(n, sample(800000L, 2500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, to be compared pairwise, where about 79% of the pairs have elements in common

a3 <- replicate(n, sample(3250000L, 1500L, replace = TRUE), simplify = FALSE)
b3 <- replicate(n, sample(3250000L, 1500L, replace = TRUE), simplify = FALSE)
# two lists of vectors, equal in length, to be compared pairwise, where about 50% of the pairs have elements in common

And here are my results:

library(microbenchmark)

LL <- c(expression(sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]]))),
        expression(sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]]))), 
        expression(sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]]))),
        expression(sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]]))),
        expression(sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]]))) )

v1 <- a1
v2 <- b1
microbenchmark(list=LL)

Unit: milliseconds
                                             expr       min        lq     mean    median       uq      max neval
 sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 110.59374 110.98621 113.5366 112.52576 114.4162 130.0801   100
 sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]]))  97.18203  97.64194 101.4938  99.20129 101.6032 158.8913   100
 sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]]))  96.98262  98.73502 100.5121  99.06029 100.6465 136.2520   100
 sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 255.85385 256.67103 262.0515 258.23332 265.1787 291.9498   100
 sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 230.49910 231.25642 236.2385 233.05208 237.7731 280.7453   100

v1 <- a2
v2 <- b2
microbenchmark(list=LL)

Unit: milliseconds
                                             expr       min        lq     mean   median       uq      max neval
 sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 112.40455 112.78578 114.8205 114.4925 114.9898 126.2302   100
 sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]]))  98.45717  98.87847 101.7272 100.5070 101.0258 134.8737   100
 sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]]))  98.15024  98.59084 101.1340 100.2553 101.2907 131.4896   100
 sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 258.48673 259.18759 264.2449 260.1710 265.2686 307.0624   100
 sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 200.79988 201.52592 205.8434 203.3817 207.2203 244.2715   100

v1 <- a3
v2 <- b3
microbenchmark(list=LL)

Unit: milliseconds
                                             expr      min       lq     mean   median       uq      max neval
 sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 134.0820 134.5529 135.4400 134.6922 135.6203 142.1575   100
 sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 119.7959 120.1119 122.3887 120.2729 122.2338 158.0306   100
 sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 119.7705 120.2145 122.3458 121.9361 122.4224 150.4227   100
 sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 257.0928 259.0730 263.2403 259.6671 263.7227 318.9604   100
 sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 226.4821 227.0798 230.2878 228.4882 231.3292 258.4599   100

v1 <- b1  # the longer vector is now vec1
v2 <- a1  
microbenchmark(list=LL)

Unit: milliseconds
                                             expr      min       lq     mean   median       uq      max neval
 sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 199.2799 201.3817 202.5054 201.6378 202.7534 214.8660   100
 sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 187.5226 187.9299 188.9177 188.1184 189.8541 196.1020   100
 sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 187.8891 188.3417 190.5641 190.1809 190.8307 219.4735   100
 sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 255.1007 255.8905 260.1282 256.8316 262.1560 288.4900   100
 sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 237.7409 238.4515 241.5251 239.9415 243.5631 266.5916   100

v1 <- b2
v2 <- a2
microbenchmark(list=LL)

Unit: milliseconds
                                             expr      min       lq     mean   median       uq      max neval
 sapply(1:n, function(k) EIC.1(v1[[k]], v2[[k]])) 198.8747 201.2476 202.1573 201.5215 202.3886 207.7772   100
 sapply(1:n, function(k) EIC.2(v1[[k]], v2[[k]])) 185.5260 185.7983 187.8099 185.9842 188.3947 225.7553   100
 sapply(1:n, function(k) EIC.3(v1[[k]], v2[[k]])) 185.8022 186.1824 188.8937 187.9226 188.6763 221.2442   100
 sapply(1:n, function(k) EIC.4(v1[[k]], v2[[k]])) 257.6607 258.5063 262.3677 259.6778 264.6313 304.4813   100
 sapply(1:n, function(k) EIC.5(v1[[k]], v2[[k]])) 230.5553 231.3261 233.9914 232.9138 235.0349 260.4950   100

In all cases, EIC.2 and EIC.3 are fastest (and very close to each other), with EIC.1 not far behind. But notice that both of them are much more efficient if the shorter vector is first. For example, where vec1 is a1 (length 500) and vec2 is b1 (length 2500), EIC.2 has a median of 99 milliseconds. But when I switch them so vec1 is b1 and vec2 is a1, EIC.2 slows down to 188 milliseconds. So for greater efficiency, it's worth checking which vector is longer, before calling EIC.2. (Or else re-writing EIC.2 so that it's always testing [shorter vector] %in% [longer vector].)

Meekins answered 6/12, 2018 at 9:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.