Use C# Vector<T> SIMD to find index of matching element
Asked Answered
K

2

7

Using C#'s Vector<T>, how can we most efficiently vectorize the operation of finding the index of a particular element in a set?

As constraints, the set will always be a Span<T> of an integer primitive, and it will contain at most 1 matching element.

I have come up with a solution that seems alright, but I'm curious if we can do better. Here is the approach:

  1. Create a Vector<T> consisting only of the target element, in each slot.
  2. Use Vector.Equals() between the input set vector and the vector from the previous step, to get a mask containing a 1 in the single matching slot (or only 0s if there is no match).
  3. Using a pre-initialized vector containing 1-based indexes (1, 2, 3, 4, ...), call Vector.Dot() between that vector and the mask from the previous step. Each index will be multiplied by 0, except the potential matching index, which will be multiplied by 1. What we get back is the sum of those multiplications, which is either 0, or the 1-based index of the matching element.
  4. If the result was 0, return -1 for no match. Otherwise, subtract one from the result to make it 0-based, and return that.

        // One-time initialized vector containing { 1, 2, 3, 4, ... }
        Vector<ushort> indexes = MemoryMarshal.Cast<ushort, Vector<ushort>>(Enumerable.Range(1, Vector<ushort>.Count).Select(index => (ushort)index).ToArray())[0];
    
        // The input set and the element to search for
        Span<ushort> set = stackalloc ushort[]{ 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 };
        ushort element = 22;
    
        // Interpret input set as a sequence of vectors (set is assumed to have length power of two for brevity)
        var setVectors = MemoryMarshal.Cast<ushort, Vector<ushort>>(set);
    
        // Create a vector that contains the target element in each slot
        var elementVector = new Vector<ushort>(element);
    
        // Loop per vector rather than per element
        foreach (var vector in setVectors)
        {
            // Get a mask that has a 1 in the single matching slot, or only 0s
            var mask = Vector.Equals(vector, elementVector);
    
            // Get the dot product of the mask and the indexes
            // This will multiple each index by 0, or by 1 if it is the matching one, and return their sum, i.e. the matching index or 0
            // Note that the indexes are deliberately 1-based, to distinguished from 0 (no match)
            var index = Vector.Dot(indexes, mask);
    
            // Either return 0 for no match, or reduce the index by 1 to get the 0-based index
            return index == 0 ? -1 : index - 1;
        }
    
Keverne answered 9/7, 2019 at 14:59 Comment(3)
That loop ends unconditionally after the first iteration, that doesn't seem rightAudun
@harold Oops, I forgot about the loop as I was writing! My example used a single vector, hehe. Of course, it should continue as long as there is no match.Keverne
Borrowing a technique from JsonReaderHelper.IndexOfOrLessThan: If we regularly expect to have multiple vectors without a match in the first one, we could optimize some Vector.Dot calls away by short-circuiting if Vector<ushort>.Zero.Equals(mask).Keverne
S
3

As I can see the simple Span<char>.IndexOf already is using Intrinsics for searching a simple value. You don't even need to cast to char to use it, since MemoryExtensions.IndexOf only cares about size and Unsafe.SizeOf<ushort>() == sizeof(char) !

Also in JsonReaderHelper.IndexOfOrLessThan you will find a more complex Vectorization example for searching. It is using byte searching, but I am sure that you can adapt it to your needs if Span<ushort>.IndexOf is not suitable.

Sipes answered 14/12, 2019 at 11:1 Comment(4)
That's very helpful! And the JsonReaderHelper.IndexOfOrLessThan, although private, does offer a byte search version. (My example had ushorts, but all integer sizes could be interesting.)Keverne
So the approach we can borrow from JsonReaderHelper.IndexOfOrLessThan boils down to this: Vector.Equals on the two vectors. Skip to the next vector if (Vector<byte>.Zero.Equals(result)). Otherwise, interpret the vector as a Vector<ulong>. For each ulong, if it is 0, continue. As soon as one is non-zero, use bit twiddling to extract the position of the match within that ulong. Although the non-vectorized work being done here is well-optimized, wouldn't it be faster to extract the index using the dot product, as proposed in my original post? Or is Vector.Dot that slow?Keverne
The LocateFirstFoundByte method only does a xor, a multiplication and a shift. I doubt Vector.Dot will be faster than thisSipes
There are actually two overloads. The top one is used, and it calls the bottom one. The top overload first treats the Vector as a Vector<ulong>, and finds the first non-zero ulong to pass to the bottom overload. The question stands!Keverne
A
2

The x86 asm you want the compiler to generate is compare-for-equal (pcmpeqb), pmovmskb or movmskps (vector to bitmask with 1-byte or 4-byte elements) and then if the mask is non-zero, bit-scan for first set bit (bsf or tzcnt).

That'll be more efficient than an integer dot product!!

You already have the compare-for-equal, and I think I've seen other C# Q&As with an intrinsic for vector->bitmap. If someone wants to edit this answer or post their own with C# that compiles / JITs to this asm, please do so. I don't know C#, I'm just here for the x86 SIMD.

Algometer answered 9/7, 2019 at 15:12 Comment(5)
Vector=>bitmap is something I'm very interested in, in its own right! I'd really like to see how to manage that in C#.Keverne
Come to think of it, MemoryMarshal.Cast<Vector<ushort>, ulong>() gets close, interpreting the vector as a Span<ulong>. But it does not narrow it down to a bitmap.Keverne
You can get that (v)movmskps from System.Runtime.Intrinsics.X86 (here), but that's new in .NET Core 3Audun
Is the preferred approach for x64 different, btw?Keverne
@Timo: oh, sorry about the terminology confusion. Outside of Microsoft/Windows, "x86" includes both IA-32 and x86-64. e.g. "x86 SIMD has efficient vector -> integer, unlike ARM where many ARM microarchitectures stall the pipeline on such instructions". I was assuming you'd actually use x86-64, not legacy IA-32. But as far as how to vectorize this, it's the same regardless of 32 vs. 64-bit. The SIMD operations are the same, and this doesn't need more than 8 XMM/YMM registers so it doesn't particularly benefit from 64-bit mode.Algometer

© 2022 - 2024 — McMap. All rights reserved.