I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to another value. In the serial version, I used std::copysign for this. Now I want to create a similar function for SSE/AVX registers. Unfortunately, the STL uses a built-in function for that, so I can't just copy the code and turn it into SSE/AVX instructions.
I have not tried it yet (so I have no code to show for now), but my simple approach would be to create a register with all values set to -0.0 so that only the signed bit is set. Then I would use an AND operation on the source to find out if its sign is set or not. The result of this operation would either be 0.0 or -0.0, depending on the sign of the source. With the result, I would create a bitmask (using logic operations) which I can combine with the target register (using another logic operation) to set the sign accordingly.
However, I am not sure if there isn't a smarter way to solve this. If there is a built-in function for fundamental data types like floats and doubles, maybe there is also an intrinsic that I missed. Any suggestions?
Thanks in advance
EDIT:
Thanks to "chtz" for this useful link:
So basically std::copysign compiles to a sequence of 2 AND operations and a subsequent OR. I will reproduce this for SSE/AVX and post the result here in case somebody else needs it some day :)
EDIT 2:
Here is my working version:
__m128 CopySign(__m128 srcSign, __m128 srcValue)
{
// Extract the signed bit from srcSign
const __m128 mask0 = _mm_set1_ps(-0.);
__m128 tmp0 = _mm_and_ps(srcSign, mask0);
// Extract the number without sign of srcValue (abs(srcValue))
__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);
// Merge signed bit with number and return
return _mm_or_ps(tmp0, tmp1);
}
Tested it with:
__m128 a = _mm_setr_ps(1, -1, -1, 1);
__m128 b = _mm_setr_ps(-5, -11, 3, 4);
__m128 c = CopySign(a, b);
for (U32 i = 0; i < 4; ++i)
std::cout << simd::GetValue(c, i) << std::endl;
The output is as expected:
5
-11
-3
4
However, I also tried the version from the disassembly where
__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);
is replaced with:
const __m128 mask1 = _mm_set1_ps(NAN);
__m128 tmp1 = _mm_and_ps(srcValue, mask1);
The results are quite strange:
4
-8
-3
4
Depending on the chosen numbers, the number is sometimes okay and sometimes not. The sign is always correct. It seems like NaN is not !(-0.0) for some reason. I remember that I had some issues before when I tried to set register values to NaN or specific bit patterns. Maybe somebody has an idea about the origin of the problem?
EDIT 3:
As 'Maxim Egorushkin' clarified in the comments of his answer, my expectation about NaN being !(-0.0) is wrong. NaN seems not to be a unique bit pattern (see https://steve.hollasch.net/cgindex/coding/ieeefloat.html).
Thank you very much to all of you!
andnps
,andps
andorps
each. If you know that yourx
is always positive, you can save theandnps
operation. If you already have a register withabs(x)
and one with-abs(x)
, you could do a singleblendvps
(but I think this is only worth it for Zen) – KatyP015
, so you should get a throughput of about 1 cycle (latency is another question). – Katy__restrict
the output (or the compiler knows itself that input and output do not overlap), clang also happily auto-vectorizes this for you (gcc apparently does not in this case): godbolt.org/z/NH7oMF – KatyNaN
is any sign bit, followed by the maximum exponent (all ones), followed by non-zero mantissa (infinity is zero mantissa). I.e. (non-existing bitwise)~-0.
is aNaN
, however a NaN is not necessarily (non-existing bitwise)~-0.
. – Schroder-march=skylake-512
it generates just 1 instruction:vpternlogd
– Crop