Performance Review
The other answers are neglecting one important aspect - performance. So, let me briefly review that. To make this realistic I create two Integer
vectors with 100,000 elements each.
using StatsBase
a = sample(1:1_000_000, 100_000)
b = sample(1:1_000_000, 100_000)
In order to know what a decent performance would be, I did the same thing in R
, leading to a median performance of 4.4 ms
:
# R code
a <- sample.int(1000000, 100000)
b <- sample.int(1000000, 100000)
microbenchmark::microbenchmark(a %in% b)
Unit: milliseconds
expr min lq mean median uq max neval
a %in% b 4.09538 4.191653 5.517475 4.376034 5.765283 65.50126 100
The performant Solution
findall(in(b),a)
5.039 ms (27 allocations: 3.63 MiB)
Slower than R
, but not by much. The syntax, however, could really use some improvement.
The imperformant Solutions
a .∈ Ref(b)
in.(a,Ref(b))
findall(x -> x in b, a)
3.879468 seconds (6 allocations: 16.672 KiB)
3.866001 seconds (6 allocations: 16.672 KiB)
3.936978 seconds (178.88 k allocations: 5.788 MiB)
800 times slower (almost 1000 times slower than R
) - this is really nothing to write home about. In my opinion the syntax of these three also isn't very good, but at least the first solution looks better to me than the 'performant solution'.
The is-not-a Solution
This one here
indexin(a,b)
5.287 ms (38 allocations: 6.53 MiB)
is performant, but for me it is not a solution. It contains nothing
elements where the element is not in the other vector. In my opinion the main application is to subset a vector, and this does not work with this solution.
a[indexin(b,a)]
ERROR: ArgumentError: unable to check bounds for indices of type Nothing