Is integer vectorization accuracy / precision of integer division CPU-dependent?
Asked Answered
C

0

11

I tried to vectorize the premultiplication of 64-bit colors of 16-bit integer ARGB channels.

I quickly realized that due to lack of accelerated integer division support I need to convert my values to float and use some SSE2/SSE4.1 intrinsics explicitly for the best performance. Still, I wanted to leave the non-specific generic version as a fallback solution (I know that it's currently slower than some vanilla operations but it would provide future compatibility for possible improvements).

However, the results are incorrect on my machine.

A very minimal repro:

// Test color with 50% alpha
(ushort A, ushort R, ushort G, ushort B) c = (0x8000, 0xFFFF, 0xFFFF, 0xFFFF);

// Minimal version of the fallback logic if HW intrinsics cannot be used:
Vector128<uint> v = Vector128.Create(c.R, c.G, c.B, 0u);
v = v * c.A / Vector128.Create(0xFFFFu);
var cPre = (c.A, (ushort)v[0], (ushort)v[1], (ushort)v[2]);

// Original color:
Console.WriteLine(c); // prints (32768, 65535, 65535, 65535)

// Expected premultiplied color:   (32768, 32768, 32768, 32768)
Console.WriteLine(cPre); // prints (32768, 32769, 32769, 32769)

I tried to determine what instructions are emitted causing the inaccuracy but I was really surprised to see that in SharpLab the results are correct. On the other hand, the issue is reproducible in .NET Fiddle.

Is it something that's expected on some platforms or should I report it in the runtime repo as a bug?


Update

Nevermind, this is clearly a bug. Using other values cause totally wrong results:

using System;
using System.Numerics;
using System.Runtime.Intrinsics;

(ushort A, ushort R, ushort G, ushort B) c = (32768, 65535, 32768, 16384);

Vector128<uint> v1 = Vector128.Create(c.R, c.G, c.B, 0u);
v1 = v1 * c.A / Vector128.Create(0xFFFFu);

// prints <32769, 49152, 57344, 0> instead of <32768, 16384, 8192, 0>
Console.WriteLine(v1);

// Also for the older Vector<T>
Span<uint> span = stackalloc uint[Vector<uint>.Count];
span[0] = c.R;
span[1] = c.G;
span[2] = c.B;
Vector<uint> v2 = new Vector<uint>(span) * c.A / new Vector<uint>(0xFFFF);

// prints <32769, 49152, 57344, 0, 0, 0, 0, 0> on my machine
Console.WriteLine(v2);

In the end I realized that the issue was at the multiplication: if I replace * c.A to the constant expression * 32768, then the result is correct. For some reason the ushort value is not correctly extracted/masked(?) out from the packed field. Even Vector.Create is affected:

(ushort A, ushort R, ushort G, ushort B) c = (32768, 65535, 32768, 16384);

Console.WriteLine(Vector128.Create((int)c.A)); // -32768
Console.WriteLine(Vector128.Create((int)32768)); // 32768
Console.WriteLine(Vector128.Create((int)c.A, (int)c.A, (int)c.A, (int)c.A)); // 32768

Update 2

In the end filed an issue in the runtime repo

Cai answered 14/3, 2023 at 11:37 Comment(7)
In the meantime I realized that SharpLab also produces the wrong result in Debug mode. Strange, because on my computer both Debug and Release builds are incorrect. So I start to believe this is a bug after all.Rigsdaler
This can be done with integer multiply and shifts, although it's not super efficient since it needs the high half of a 32x32 => 64-bit multiply, and SSE2 / AVX only gives you that with pmuludq which gives you the full 64-bit results. So only half the input elements per vector. godbolt.org/z/7E9P9aWMh shows GCC and clang using a multiplicative inverse for dividing a vector of 4x uint32_t by 0xffffu. This is exact integer division, no FP involved at any point.Michalemichalski
I don't see anything in the software fallback path which would cause this... I'd be tempted to either raise this in the dotnet/runtime repo, or ask in #allow-unsafe-blocks on discord (lots of JIT people hang out there, and they'll tell you if you need to open an issue)Taboo
If you were using FP, it would probably perform better to multiply by 1.0f/0xffff (or however you write a float constant in C#), although that can't be represented exactly. But with about 23 bits of precision, might still give the correct 16-bit integer after rounding or truncating to 16-bit.Michalemichalski
@PeterCordes: yes, bit shifting is faster with vanilla code than division but the results are not exactly the same. And the floating point version now works well, it's just about the issue I noticed.Rigsdaler
Alright, this must be a bug, using some other values end up in total off results. I will update the post soon.Rigsdaler
@GyörgyKőszeg: The compiler output I linked is for vec / 0xffffu. The results are exact, using a multiplicative inverse like I said, same as compilers do for scalar uint / constant. Why does GCC use multiplication by a strange number in implementing integer division? Did you think I meant v >> 16? Oh, you probably thought I meant doing the * part of you expression with an integer multiply; I was talking about using the high half of integer multiply as part of the division.Michalemichalski

© 2022 - 2024 — McMap. All rights reserved.