I tried to vectorize the premultiplication of 64-bit colors of 16-bit integer ARGB channels.
I quickly realized that due to lack of accelerated integer division support I need to convert my values to float
and use some SSE2/SSE4.1 intrinsics explicitly for the best performance. Still, I wanted to leave the non-specific generic version as a fallback solution (I know that it's currently slower than some vanilla operations but it would provide future compatibility for possible improvements).
However, the results are incorrect on my machine.
A very minimal repro:
// Test color with 50% alpha
(ushort A, ushort R, ushort G, ushort B) c = (0x8000, 0xFFFF, 0xFFFF, 0xFFFF);
// Minimal version of the fallback logic if HW intrinsics cannot be used:
Vector128<uint> v = Vector128.Create(c.R, c.G, c.B, 0u);
v = v * c.A / Vector128.Create(0xFFFFu);
var cPre = (c.A, (ushort)v[0], (ushort)v[1], (ushort)v[2]);
// Original color:
Console.WriteLine(c); // prints (32768, 65535, 65535, 65535)
// Expected premultiplied color: (32768, 32768, 32768, 32768)
Console.WriteLine(cPre); // prints (32768, 32769, 32769, 32769)
I tried to determine what instructions are emitted causing the inaccuracy but I was really surprised to see that in SharpLab the results are correct. On the other hand, the issue is reproducible in .NET Fiddle.
Is it something that's expected on some platforms or should I report it in the runtime repo as a bug?
Update
Nevermind, this is clearly a bug. Using other values cause totally wrong results:
using System;
using System.Numerics;
using System.Runtime.Intrinsics;
(ushort A, ushort R, ushort G, ushort B) c = (32768, 65535, 32768, 16384);
Vector128<uint> v1 = Vector128.Create(c.R, c.G, c.B, 0u);
v1 = v1 * c.A / Vector128.Create(0xFFFFu);
// prints <32769, 49152, 57344, 0> instead of <32768, 16384, 8192, 0>
Console.WriteLine(v1);
// Also for the older Vector<T>
Span<uint> span = stackalloc uint[Vector<uint>.Count];
span[0] = c.R;
span[1] = c.G;
span[2] = c.B;
Vector<uint> v2 = new Vector<uint>(span) * c.A / new Vector<uint>(0xFFFF);
// prints <32769, 49152, 57344, 0, 0, 0, 0, 0> on my machine
Console.WriteLine(v2);
In the end I realized that the issue was at the multiplication: if I replace * c.A
to the constant expression * 32768
, then the result is correct. For some reason the ushort
value is not correctly extracted/masked(?) out from the packed field. Even Vector.Create
is affected:
(ushort A, ushort R, ushort G, ushort B) c = (32768, 65535, 32768, 16384);
Console.WriteLine(Vector128.Create((int)c.A)); // -32768
Console.WriteLine(Vector128.Create((int)32768)); // 32768
Console.WriteLine(Vector128.Create((int)c.A, (int)c.A, (int)c.A, (int)c.A)); // 32768
Update 2
In the end filed an issue in the runtime repo
pmuludq
which gives you the full 64-bit results. So only half the input elements per vector. godbolt.org/z/7E9P9aWMh shows GCC and clang using a multiplicative inverse for dividing a vector of 4xuint32_t
by0xffffu
. This is exact integer division, no FP involved at any point. – Michalemichalski#allow-unsafe-blocks
on discord (lots of JIT people hang out there, and they'll tell you if you need to open an issue) – Taboo1.0f/0xffff
(or however you write afloat
constant in C#), although that can't be represented exactly. But with about 23 bits of precision, might still give the correct 16-bit integer after rounding or truncating to 16-bit. – Michalemichalskivec / 0xffffu
. The results are exact, using a multiplicative inverse like I said, same as compilers do for scalaruint / constant
. Why does GCC use multiplication by a strange number in implementing integer division? Did you think I meantv >> 16
? Oh, you probably thought I meant doing the*
part of you expression with an integer multiply; I was talking about using the high half of integer multiply as part of the division. – Michalemichalski