Julia: does an Array contain a specific sub-array
Asked Answered
B

7

14

In julia we can check if an array contains a value, like so:

> 6 in [4,6,5]
true

However this returns false, when attempting to check for a sub-array in a specific order:

> [4,6] in [4,6,5]
false

What is the correct syntax to verify if a specific sub-array exists in an array?

Breeks answered 1/4, 2016 at 0:20 Comment(4)
The second result in the question does not match its description. It is a tuple of 4 and the first result.Selfrespect
Package Iterators.jl also provides a useful function subsets, and you can write [4,6] in subsets([4,5,6]).Visional
That doesn't give the correct result, and even if it did, it doesn't scale at all (I benchmarked all of these with different lengths of vectors with Int64s)Minor
I misunderstood the question, for those who would like to check whether each element of array A(not consider A as a whole sequence) is included in another array B, setdiff(A, B) |> isempty is sufficient to do the job.Visional
S
6

For the third condition i.e. vector [4,6] appears as a sub-vector of 4,6,5 the following function is suggested:

issubvec(v,big) = 
  any([v == slice(big,i:(i+length(v)-1)) for i=1:(length(big)-length(v)+1)])

For the second condition, that is, give a boolean for each element in els vectors which appears in set vector, the following is suggested:

function vecin(els,set)
  res = zeros(Bool,size(els))
  res[findin(els,set)]=true
  res
end

With the vector in the OP, these result in:

julia> vecin([4,6],[4,6,5])
2-element Array{Bool,1}:
 true
 true

julia> issubvec([4,6],[4,6,5])
true
Selfrespect answered 1/4, 2016 at 0:43 Comment(3)
issubvec does return the correct result, but is also not very performant, because it is making many allocations. It's a good idea to use @time to see performance suffers due to excessive allocations.Minor
issubvec is certainly unoptimized, @ScottJones, but its logic is very clear - which was my intention. The function you wrote is better (and even more optimized algorithms for searching substrings/subvectors exist). I think such a generic subvectors functions might fit in Base (with similar names to string functions).Selfrespect
Actually, I had to struggle with the logic of issubvec, with the combination of array comprehension and using the slice and any functions. That's not meant as a criticism, I love seeing the powerful things that can be done with Julia's array functions, but coming from C/C++/Java etc. I had to twist my brain to comprehend it. Also, I've seen that short sweet code like that often doesn't scale, and I'm a performance guy 🤓Minor
R
12

I think it is worth mentioning that in Julia 1.0 you have the function issubset

> issubset([4,6], [4,6,5])
true

You can also quite conveniently call it using the \subseteq latex symbol

> [4,6] ⊆ [4,6,5]
true

This looks pretty optimized to me:

> using Random

> x, y = randperm(10^3)[1:10^2], randperm(10^3);

> @btime issubset(x, y);
16.153 μs (12 allocations: 45.96 KiB)
Rammer answered 1/11, 2018 at 10:34 Comment(2)
Wow, very nice, this should be the selected answers. Still works in julia 1.2.0Noach
Note that a subset is different from a subarray. [6,4] is not a subarray of [4,6,5].Cooke
M
7

It takes a little bit of code to make a function that performs well, but this is much faster than the issubvec version above:

function subset2(x,y)
    lenx = length(x)
    first = x[1]
    if lenx == 1
        return findnext(y, first, 1) != 0
    end
    leny = length(y)
    lim = length(y) - length(x) + 1
    cur = 1
    while (cur = findnext(y, first, cur)) != 0
        cur > lim && break
        beg = cur
        @inbounds for i = 2:lenx
            y[beg += 1] != x[i] && (beg = 0 ; break)
        end
        beg != 0 && return true
        cur += 1
    end
    false
end

Note: it would also be much more useful if the function actually returned the position of the beginning of the subarray if found, or 0 if not, similarly to the findfirst/findnext functions.

Timing information (the second one is using my subset2 function):

  0.005273 seconds (65.70 k allocations: 4.073 MB)
  0.000086 seconds (4 allocations: 160 bytes)
Minor answered 2/4, 2016 at 0:16 Comment(8)
The first @time result (for issubvec) looks like it might include compilation - it is too much of an outlier for such a simple call. Can you recheck (with a compile run before timing)?Selfrespect
Not an outlier - I of course compiled before running (without the @time macro). I also tested various lengths, that one I showed was testing with vector of length 64K, searching for a sequence of 4 (the last 4 values in the vector). issubvec seems to have O(n) allocations, where n is the length of y.Minor
OK. The test case is important. If you add the test run code exactly, I can see if using Julia 0.5 vs. 0.4 can be important in this case.Selfrespect
Maybe better to publish as a gist, and put the link here?Minor
Let's just leave it. Really, subset2 is more optimized (and if one wants to push, there are some more optimizations), but it might be for another discussion.Selfrespect
Yes, at least it's a starting point for people who need something fast, and I think your issubvec is demonstrative of what can be done with the array functions in the standard library.Minor
Does someone know whether this approach is broken with current julia versions i.e. what could be the reason. julia 1.5.3 yields: ERROR: LoadError: MethodError: no method matching findnext(::Array{String,1}, ::String, ::Int64) Closest candidates are: findnext(::Regex, ::Union{String, SubString}, ::Integer) at regex.jl:299 findnext(::Regex, ::AbstractString, ::Integer) at regex.jl:324 findnext(::Base.RegexAndMatchData, ::Any, ::Any) at regex.jl:460 Ames
This doesn't work on julia 1.7. Tried to write upgraded code here, but stackoverflow won't format correctly multiline code in comments. Pb is that findnext has changed. One could write, for instance, the beginning of the while loop like this: while !isnothing(begin cur = findnext(==(first), y, cur) end).Coachwork
S
6

For the third condition i.e. vector [4,6] appears as a sub-vector of 4,6,5 the following function is suggested:

issubvec(v,big) = 
  any([v == slice(big,i:(i+length(v)-1)) for i=1:(length(big)-length(v)+1)])

For the second condition, that is, give a boolean for each element in els vectors which appears in set vector, the following is suggested:

function vecin(els,set)
  res = zeros(Bool,size(els))
  res[findin(els,set)]=true
  res
end

With the vector in the OP, these result in:

julia> vecin([4,6],[4,6,5])
2-element Array{Bool,1}:
 true
 true

julia> issubvec([4,6],[4,6,5])
true
Selfrespect answered 1/4, 2016 at 0:43 Comment(3)
issubvec does return the correct result, but is also not very performant, because it is making many allocations. It's a good idea to use @time to see performance suffers due to excessive allocations.Minor
issubvec is certainly unoptimized, @ScottJones, but its logic is very clear - which was my intention. The function you wrote is better (and even more optimized algorithms for searching substrings/subvectors exist). I think such a generic subvectors functions might fit in Base (with similar names to string functions).Selfrespect
Actually, I had to struggle with the logic of issubvec, with the combination of array comprehension and using the slice and any functions. That's not meant as a criticism, I love seeing the powerful things that can be done with Julia's array functions, but coming from C/C++/Java etc. I had to twist my brain to comprehend it. Also, I've seen that short sweet code like that often doesn't scale, and I'm a performance guy 🤓Minor
I
3

note that you can now vectorize in with a dot:

julia> in([4,6,5]).([4, 6])
2-element BitArray{1}:
 true
 true

and chain with all to get your answer:

julia> all(in([4,6,5]).([4, 6]))
true
Ingeringersoll answered 30/10, 2019 at 8:0 Comment(1)
Nice. What if you want to avoid repeated item? For example all(in([4,6,5]).([4, 6, 6])) should return false, not true.Inhabited
H
2

There is no standard Julia library function to determine if a particular sequence occurs as a subsequence of another. Probably this is because this is actually a fairly trickly problem (known as the String-searching algorithm) and quite how to do it depends on whether you'll be searching repeatedly, whether you want to do multiple matches, have multiple patterns, want fuzzy matches, etc.

The other answers here give reasonable answers but some are old and Julia has improved, and I wanted to offer a slightly more idiomatic solution.

function issubarray(needle, haystack)
    getView(vec, i, len) = view(vec, i:i+len-1)
    ithview(i) = getView(haystack, i, length(needle))
    return any(i -> ithview(i) == needle, 1:length(haystack)-length(needle)+1)
end

This is lightening fast and requires almost no memory - Julia's view is lightweight and efficient. And, as always with Julia, the solution is generally to simply define more functions.

Hussar answered 20/12, 2022 at 7:59 Comment(0)
L
1

I used this recently to find subsequences in arrays of integers. It's not as good or as fast as @scott's subset2(x,y)... but it returns the indices.

function findsequence(arr::Array{Int64}, seq::Array{Int64})
    indices = Int64[]
    i = 1
    n = length(seq)
    if n == 1
        while true
            occurrence = findnext(arr, seq[1], i)
            if occurrence == 0
                break
            else
                push!(indices, occurrence)
                i = occurrence +1
            end
        end
    else
        while true
            occurrence = Base._searchindex(arr, seq, i)
            if occurrence == 0
                break
            else
                push!(indices, occurrence)
                i = occurrence +1
            end
        end
    end
    return indices
end

julia> @time findsequence(rand(1:9, 1000), [2,3])
    0.000036 seconds (29 allocations: 8.766 KB)
    16-element Array{Int64,1}:
   80
  118
  138
  158
  234
  243
  409
  470
  539
  589
  619
  629
  645
  666
  762
  856
Letterpress answered 2/4, 2016 at 7:56 Comment(1)
Yes, that's very useful. I also wasn't aware of Base._searchindex, I'll have to benchmark it! I think an iterator would be good as well, so as not to create a potentially large vector (could be up to the length of seq).Minor
A
0

Here is a more up-to-date implementation using findall

function issubsequence(A, B)
    B1inA = findall(isequal(B[1]), A) # indices of the first element of b occuring in a
    matchFirstIndex = [] # Saves the first index in A of the occurances
    for i in B1inA
        if length(A[i:end]) < length(B) continue end
        if A[i:i + length(B) - 1] == B
            push!(matchFirstIndex, i)
        end
    end
    return matchFirstIndex
end

I get a similar runtime to @daycaster

@time issubsequence(rand(1:9, 1000), [2,3])
  0.000038 seconds (111 allocations: 20.844 KiB)
7-element Vector{Any}:
  57
 427
 616
 644
 771
 855
 982
Asepsis answered 10/8, 2022 at 10:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.