C# - Construct a signal Vector<T> from an integer bitmask
Asked Answered
E

2

3

I have some integer value representing a bitmask, for example 154 = 0b10011010, and I want to construct a corresponding signal Vector<T> instance <0, -1, 0, -1, -1, 0, 0, -1> (note the least/most significant bit order is reversed).

Is there any more performant way than this (beyond unrolling the loop)?

int mask = 0b10011010;// 154
// -1 = (unsigned) 0xFFFFFFFF is the "all true" value
Vector<int> maskVector = new Vector<int>(
    Enumerable.Range(0, Vector<int>.Count)
        .Select(i => (mask & (1 << i)) > 0 ? -1 : 0)
        .ToArray());
// <0, -1, 0, -1, -1, 0, 0, -1>
string maskVectorStr = string.Join("", maskVector);

(Note the debugger is bugged displaying Vector<T> values, showing only half of the components and the rest as zeros, hence my use of string.Join.)

Namely, is there a way with only a single instruction? It is the Single Instruction, Multiple Data (SIMD) datatype after all!

Furthermore, how I can do this when working with the generic Vector<T> version?


A signal vector, or integral mask vector, is used with the ConditionalSelect method to choose between values of two other masks:

//powers of two <1, 2, 4, 8, 16, 32, 64, 128>
Vector<int> ifTrueVector = new Vector<int>(Enumerable.Range(0, Vector<int>.Count).Select(i => 1 << i).ToArray());
Vector<int> ifFalseVector = Vector<int>.Zero;// or some other vector
// <0, 2, 0, 8, 16, 0, 0, 128>
Vector<int> resultVector = Vector.ConditionalSelect(maskVector, ifTrueVector, ifFalseVector);
string resultStr = string.Join("", resultVector);
// our original mask value back
int sum = Vector.Dot(resultVector, Vector<int>.One);

enter image description here


The documentation of ConditionalSelect explicitly says the mask vector has integral values for every overload, but spamming Vector<T>.Zero[0] and Vector<T>.One[0] to get them is surely improper? (And you can get the T version of -1 with (-Vector<T>.One)[0])

P.S. would there also be a corresponding solution to populating with powers of 2?

Encyclical answered 21/12, 2021 at 13:16 Comment(10)
How did you get from 0b10011010 to <0, -1, 0, -1, -1, 0, 0, -1>?Threegaited
Reverse the order. The least significant bit is the first vector component.Encyclical
(Note that your ConditionalSelect could use be an &, I think? That means no need for ifFalseVector. See the implementation)Threegaited
I do not see anything wrong with your code. The operation requires looping 16 times.Rockribbed
This is occurring in a performance-critical part of a rendering engine, which otherwise only performs two ConditionalSelect statements from the bitmask (and sometimes two subtractions), and intrinsics only take a couple CPU cycles.Encyclical
@Threegaited this is simplified for my question. I'm using two nontrivial vectors with ConditionalSelect.Encyclical
Can you use System.Runtime.Intrinsics.X86? If so, there's x86 specific efficient solutionSybyl
As @Alex said, if you can use x86 intrinsics, see is there an inverse instruction to the movemask instruction in intel avx2? for a list of strategies for different mask width / vec element sizes, with C++ intrinsics that you can port.Fried
Just curious, why did you accept no answer?Helminth
I need to spend some quality time testing the provided answers first. They both look quite good.Encyclical
I
4

It is possible to do the complement of it with the vector operations offered by the Vector API, and equivalent variants could be made for any T, for example like this for T = int:

Vector<int> powersOfTwo;

Vector<int> MaskToElements(int mask)
{
    Vector<int> broadcasted = new Vector<int>(mask);
    Vector<int> singlebit = Vector.BitwiseAnd(broadcasted, powersOfTwo);
    return Vector.Equals(singlebit, powersOfTwo);
}

Where powersOfTwo is created like this:

int[] powersOfTwoArray = new int[Vector<int>.Count];
for (int i = 0; i < powersOfTwoArray.Length; i++)
{
    powersOfTwoArray[i] = 1 << i;
}
powersOfTwo = new Vector<int>(powersOfTwoArray);

It's not really generic, even though it would work for different types, largely because powersOfTwo must be precomputed otherwise this function won't make any sense: if there is a scalar loop in it anyway, what's the point.


If the mask is a constant, then with the System.Runtime.Intrinsics.X86 API it can go directly into a Blend, it doesn't need to be turned into a vector mask first, for example:

Vector128.AsInt32(Sse41.Blend(Vector128.AsSingle(a), Vector128.AsSingle(b), (byte)mask));

If the mask is not a constant, that API will still accept it, but it ends up calling a slow fallback. In that case, it's better to make a vector mask and use BlendVariable.

Islander answered 21/12, 2021 at 17:46 Comment(1)
Other strategies for other element widths shown in: is there an inverse instruction to the movemask instruction in intel avx2?. A different strategy is required for small elements, when the bitmask is wider than an element. e.g. 16 bits -> vector of 16x bytes.Fried
T
3

There might be a fancy vector-based way to generate your mask vector, but simply optimizing your current code can speed things up by over an order of magnitude.

Firstly, don't use Linq on hot paths. The number of intermediate object allocations, virtual method calls and delegate invocations going on there is simply unnecessary if you're looking for speed. You can rewrite this as a for loop with no loss of clarity.

Secondly, get rid of that array allocation. Vector<T> has constructors which take a Span<T>, and you can stackalloc one of those.

That gives you some code which looks a bit like this:

int mask = 0b10011010;

Span<int> values = stackalloc int[Vector<int>.Count];
for (int i = 0; i < Vector<int>.Count; i++)
{
    values[i] = (mask & (1 << i)) > 0 ? -1 : 0;
}

var maskVector = new Vector<int>(values);

Interestingly, manually unrolling that loop gives you another significant speed-up:

Span<int> values = stackalloc int[Vector<int>.Count];
values[0] = (mask & 0x1) > 0 ? -1 : 0;
values[1] = (mask & 0x2) > 0 ? -1 : 0;
values[2] = (mask & 0x4) > 0 ? -1 : 0;
values[3] = (mask & 0x8) > 0 ? -1 : 0;
values[4] = (mask & 0x10) > 0 ? -1 : 0;
values[5] = (mask & 0x20) > 0 ? -1 : 0;
values[6] = (mask & 0x40) > 0 ? -1 : 0;
values[7] = (mask & 0x80) > 0 ? -1 : 0;

var maskVector = new Vector<int>(values);

How does this perform? Let's use BenchmarkDotNet:

[MemoryDiagnoser]
public class MyBenchmark
{
    [Benchmark, Arguments(0b10011010)]
    public Vector<int> Naive(int mask)
    {
        Vector<int> maskVector = new Vector<int>(
            Enumerable.Range(0, Vector<int>.Count)
                .Select(i => (mask & (1 << i)) > 0 ? -1 : 0)
                .ToArray());

        return maskVector;
    }

    [Benchmark, Arguments(0b10011010)]
    public Vector<int> Optimised(int mask)
    {
        Span<int> values = stackalloc int[Vector<int>.Count];
        for (int i = 0; i < Vector<int>.Count; i++)
        {
            values[i] = (mask & (1 << i)) > 0 ? -1 : 0;
        }

        var output = new Vector<int>(values);
        return output;
    }

    [Benchmark, Arguments(0b10011010)]
    public Vector<int> Optimised2(int mask)
    {
        Span<int> values = stackalloc int[Vector<int>.Count];
        values[0] = (mask & 0x1) > 0 ? -1 : 0;
        values[1] = (mask & 0x2) > 0 ? -1 : 0;
        values[2] = (mask & 0x4) > 0 ? -1 : 0;
        values[3] = (mask & 0x8) > 0 ? -1 : 0;
        values[4] = (mask & 0x10) > 0 ? -1 : 0;
        values[5] = (mask & 0x20) > 0 ? -1 : 0;
        values[6] = (mask & 0x40) > 0 ? -1 : 0;
        values[7] = (mask & 0x80) > 0 ? -1 : 0;

        var output = new Vector<int>(values);
        return output;
    }
}

public class Program
{
    public static void Main()
    {
        var summary = BenchmarkRunner.Run<MyBenchmark>();
    }
}

This gives the results:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1415 (21H2)
Intel Core i7-8565U CPU 1.80GHz (Whiskey Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.101
  [Host]     : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET 5.0.1 (5.0.120.57516), X64 RyuJIT
Method mask Mean Error StdDev Gen 0 Allocated
Naive 154 103.018 ns 2.0509 ns 4.0001 ns 0.0554 232 B
Optimised 154 13.405 ns 0.3004 ns 0.4497 ns - -
Optimised2 154 9.668 ns 0.2827 ns 0.8245 ns - -
Threegaited answered 21/12, 2021 at 13:50 Comment(4)
Dang I was hoping for some obvious, or special-purpose function to do it directly, but also dang good speedup!Encyclical
@Encyclical Turns out unrolling that loop gets you another decent speed-upThreegaited
Is there any hope of generalizing this to the generic Vector<T> and not just int? And what if the value of Vector<int>.Count isn't 8 one day?Encyclical
@Encyclical Yeah, that's where the unrolled loop falls down. But then, if it decreases one day your code's already in trouble. I assume you're thinking of letting T be another integer type? Something like this? dotnetfiddle.net/6zwjXWThreegaited

© 2022 - 2024 — McMap. All rights reserved.