Vectorized "in" function in julia?
Asked Answered
M

6

17

I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing

giant_list = ["a", "c", "j"]
good_letters = ["a", "b"]
isin = falses(size(giant_list,1))
for i=1:size(giant_list,1)
    isin[i] = giant_list[i] in good_letters
end

Is there any vectorized (doubly-vectorized?) way to do this in julia? In analogy with the basic operators I want to do something like

isin = giant_list .in good_letters

I realize this may not be possible, but I just wanted to make sure I wasn't missing something. I know I could probably use DefaultDict from DataStructures to do the similar but don't know of anything in base.

Misnomer answered 15/4, 2015 at 21:31 Comment(0)
F
15

The indexin function does something similar to what you want:

indexin(a, b)

Returns a vector containing the highest index in b for each value in a that is a member of b. The output vector contains 0 wherever a is not a member of b.

Since you want a boolean for each element in your giant_list (instead of the index in good_letters), you can simply do:

julia> indexin(giant_list, good_letters) .> 0
3-element BitArray{1}:
  true
 false
 false

The implementation of indexin is very straightforward, and points the way to how you might optimize this if you don't care about the indices in b:

function vectorin(a, b)
    bset = Set(b)
    [i in bset for i in a]
end

Only a limited set of names may be used as infix operators, so it's not possible to use it as an infix operator.

Fonsie answered 15/4, 2015 at 21:54 Comment(1)
This works nicely with DataArrays, although it doesn't play nicely with NAs (luckily that doesn't matter for me).Misnomer
S
9

There are a handful of modern (i.e. Julia v1.0) solutions to this problem:

First, an update to the scalar strategy. Rather than using a 1-element tuple or array, scalar broadcasting can be achieved using a Ref object:

julia> in.(giant_list, Ref(good_letters))
3-element BitArray{1}:
  true
 false
 false

This same result can be achieved by broadcasting the infix (\inTAB) operator:

julia> giant_list .∈ Ref(good_letters)
3-element BitArray{1}:
  true
 false
 false

Additionally, calling in with one argument creates a Base.Fix2, which may later be applied via a broadcasted call. This seems to have limited benefits compared to simply defining a function, though.

julia> is_good1 = in(good_letters);
       is_good2(x) = x in good_letters;

julia> is_good1.(giant_list)
3-element BitArray{1}:
  true
 false
 false

julia> is_good2.(giant_list)
3-element BitArray{1}:
  true
 false
 false

All in all, using .∈ with a Ref will probably lead to the shortest, cleanest code.

Statute answered 12/9, 2018 at 0:28 Comment(0)
A
4

You can vectorize in quite easily in Julia v0.6, using the unified broadcasting syntax.

julia> in.(giant_list, (good_letters,))
3-element Array{Bool,1}:
  true
 false
 false

Note the scalarification of good_letters by using a one-element tuple. Alternatively, you can use a Scalar type such as the one introduced in StaticArrays.jl.

Julia v0.5 supports the same syntax, but requires a specialized function for scalarificiation (or the Scalar type mentioned earlier):

scalar(x) = setindex!(Array{typeof(x)}(), x)

after which

julia> in.(giant_list, scalar(good_letters))
3-element Array{Bool,1}:
  true
 false
 false
Ahmedahmedabad answered 17/10, 2016 at 13:7 Comment(4)
I think I'm missing something here: the command above and several variations on throw errors in 0.5.0.Misnomer
@ARM: Ah, sorry, it turns out I had been testing on 0.6.0, where the exact command works. On 0.5.0 you need a zero-dimensional array (or the Scalar type).Ahmedahmedabad
@Misnomer I have updated my solution with an answer that works in 0.5 also.Ahmedahmedabad
Note: this works for Arrays but not DataArrays on 0.6. Matt B.'s answer continues to work for both.Misnomer
A
1

findin() doesn't give you a boolean mask, but you can easily use it to subset an array/DataFrame for values that are contained in another array:

julia> giant_list[findin(giant_list, good_letters)]
1-element Array{String,1}:
 "a"
Arboriculture answered 25/5, 2017 at 17:32 Comment(1)
Note that findin is deprecated in 0.7 and removed from 1.0. findall(in(...), ...) replaces it.Parent
C
1

Performance Review

The other answers are neglecting one important aspect - performance. So, let me briefly review that. To make this realistic I create two Integer vectors with 100,000 elements each.

using StatsBase

a = sample(1:1_000_000, 100_000)
b = sample(1:1_000_000, 100_000)

In order to know what a decent performance would be, I did the same thing in R, leading to a median performance of 4.4 ms:

# R code

a <- sample.int(1000000, 100000)
b <- sample.int(1000000, 100000)

microbenchmark::microbenchmark(a %in% b)

Unit: milliseconds
     expr     min       lq     mean   median       uq      max neval
 a %in% b 4.09538 4.191653 5.517475 4.376034 5.765283 65.50126   100

The performant Solution

findall(in(b),a)

5.039 ms (27 allocations: 3.63 MiB)

Slower than R, but not by much. The syntax, however, could really use some improvement.

The imperformant Solutions

a .∈ Ref(b)
in.(a,Ref(b))
findall(x -> x in b, a)

3.879468 seconds (6 allocations: 16.672 KiB)
3.866001 seconds (6 allocations: 16.672 KiB)
3.936978 seconds (178.88 k allocations: 5.788 MiB)

800 times slower (almost 1000 times slower than R) - this is really nothing to write home about. In my opinion the syntax of these three also isn't very good, but at least the first solution looks better to me than the 'performant solution'.

The is-not-a Solution

This one here

indexin(a,b)

5.287 ms (38 allocations: 6.53 MiB)

is performant, but for me it is not a solution. It contains nothing elements where the element is not in the other vector. In my opinion the main application is to subset a vector, and this does not work with this solution.

a[indexin(b,a)]

ERROR: ArgumentError: unable to check bounds for indices of type Nothing
Countersignature answered 30/10, 2020 at 14:46 Comment(0)
N
0

Just found another solution on the Julia Discourse, which uses IMO a slightly more concise syntax:

in(giant_list).(good_letters)

gives you a BitVector. To check whether there is a generic match between the sequences, you can use the any() function:

any(in(giant_list).(good_letters)) # true
Navarro answered 9/3, 2023 at 15:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.