Writing a portable SSE/AVX version of std::copysign
Asked Answered
B

2

7

I am currently writing a vectorized version of the QR decomposition (linear system solver) using SSE and AVX intrinsics. One of the substeps requires to select the sign of a value opposite/equal to another value. In the serial version, I used std::copysign for this. Now I want to create a similar function for SSE/AVX registers. Unfortunately, the STL uses a built-in function for that, so I can't just copy the code and turn it into SSE/AVX instructions.

I have not tried it yet (so I have no code to show for now), but my simple approach would be to create a register with all values set to -0.0 so that only the signed bit is set. Then I would use an AND operation on the source to find out if its sign is set or not. The result of this operation would either be 0.0 or -0.0, depending on the sign of the source. With the result, I would create a bitmask (using logic operations) which I can combine with the target register (using another logic operation) to set the sign accordingly.

However, I am not sure if there isn't a smarter way to solve this. If there is a built-in function for fundamental data types like floats and doubles, maybe there is also an intrinsic that I missed. Any suggestions?

Thanks in advance

EDIT:

Thanks to "chtz" for this useful link:

https://godbolt.org/z/oY0f7c

So basically std::copysign compiles to a sequence of 2 AND operations and a subsequent OR. I will reproduce this for SSE/AVX and post the result here in case somebody else needs it some day :)

EDIT 2:

Here is my working version:

__m128 CopySign(__m128 srcSign, __m128 srcValue)
{
    // Extract the signed bit from srcSign
    const __m128 mask0 = _mm_set1_ps(-0.);
    __m128 tmp0 = _mm_and_ps(srcSign, mask0);

    // Extract the number without sign of srcValue (abs(srcValue))
    __m128 tmp1 = _mm_andnot_ps(mask0, srcValue);

    // Merge signed bit with number and return
    return _mm_or_ps(tmp0, tmp1);
}

Tested it with:

__m128 a = _mm_setr_ps(1, -1, -1, 1);
__m128 b = _mm_setr_ps(-5, -11, 3, 4);

__m128 c = CopySign(a, b);

for (U32 i = 0; i < 4; ++i)
    std::cout << simd::GetValue(c, i) << std::endl;

The output is as expected:

5
-11
-3
4

However, I also tried the version from the disassembly where

__m128 tmp1 = _mm_andnot_ps(mask0, srcValue);

is replaced with:

const __m128 mask1 = _mm_set1_ps(NAN);
__m128 tmp1 = _mm_and_ps(srcValue, mask1);

The results are quite strange:

4
-8
-3
4

Depending on the chosen numbers, the number is sometimes okay and sometimes not. The sign is always correct. It seems like NaN is not !(-0.0) for some reason. I remember that I had some issues before when I tried to set register values to NaN or specific bit patterns. Maybe somebody has an idea about the origin of the problem?

EDIT 3:

As 'Maxim Egorushkin' clarified in the comments of his answer, my expectation about NaN being !(-0.0) is wrong. NaN seems not to be a unique bit pattern (see https://steve.hollasch.net/cgindex/coding/ieeefloat.html).

Thank you very much to all of you!

Basinet answered 10/9, 2019 at 12:32 Comment(15)
Welcome to stackoverflow and you should try it first. As a hint for future questions please read stackoverflow.com/help/how-to-ask. And please do not assume someone will write code for you.Behling
@Behling The OP was asking about the wisdom of their proposed approach, not how to code it or whether it would work or not.Tucker
Unless you have any a-priory knowledge about any of your inputs, you won't get better than doing one andnps, andps and orps each. If you know that your x is always positive, you can save the andnps operation. If you already have a register with abs(x) and one with -abs(x), you could do a single blendvps (but I think this is only worth it for Zen)Katy
I do not expect somebody to write code for me. Sorry if it sounds that way. It is more like that I think that there might be a much simpler way (existing intrinsics, single logic operation) than what I described. I will start writing the code in a couple of hours and edit it into my post. However, doesn't make sense to write code if somebody tells me ' just use _mm_<copysign>_ps' for that or 'just use XOR with xyz to get what you want'Basinet
Btw: You should also always check, what your compiler already does for you: godbolt.org/z/oY0f7cKaty
@chtz: Unfortunately, there is no a-priory knowledge about the inputs. So as I feared, I have to use multiple logic operations. I will write the function and edit it into the original post as soon as it is done. Maybe, you guys see some potential for optimizations. However, if it is really just 3 operations it would probably take only 3 cycles which isn't too bad.Basinet
Can't use the blend instructions for this, they're not bit-granular, they select whole floats at once (or bytes at the narrowest). So it's back to the old and/and(n)/or sequenceHauptmann
On all Intel CPUs (I know of), bit-operations work on any port of P015, so you should get a throughput of about 1 cycle (latency is another question).Katy
@chtz: Wow, thanks for the link. Of course, makes total sense to take a look at what the compiler does.Basinet
If you __restrict the output (or the compiler knows itself that input and output do not overlap), clang also happily auto-vectorizes this for you (gcc apparently does not in this case): godbolt.org/z/NH7oMFKaty
NaN is any sign bit, followed by the maximum exponent (all ones), followed by non-zero mantissa (infinity is zero mantissa). I.e. (non-existing bitwise) ~-0. is a NaN, however a NaN is not necessarily (non-existing bitwise) ~-0..Schroder
en.m.wikipedia.org/wiki/NaN says: "a bit-wise IEEE floating-point standard single precision (32-bit) NaN would be: s111 1111 1xxx xxxx xxxx xxxx xxxx xxxx where s is the sign (most often ignored in applications) and the x sequence represents a non-zero number (the value zero encodes infinities). The first bit from x is used to determine the type of NaN: "quiet NaN" or "signaling NaN". The remaining bits encode a payload (most often ignored in applications)."Schroder
@Maxim Egorushkin: Very interesting. Thanks again for the additional information.Basinet
@chtz: Actually I was wrong, about the a-priory information. Since the value of interest is the square root of a square sum, my x is always positive and I can drop the _mm_andnot_ps. Maybe I use a template parameter to choose if the _mm_andnot_ps should be used or not. Wouldn't have seen this possible micro-optimization if you hadn't mentioned it.Basinet
with -march=skylake-512 it generates just 1 instruction: vpternlogdCrop
S
8

AVX versions for float and double:

#include <immintrin.h>

__m256 copysign_ps(__m256 from, __m256 to) {
    constexpr float signbit = -0.f;
    auto const avx_signbit = _mm256_broadcast_ss(&signbit);
    return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

__m256d copysign_pd(__m256d from, __m256d to) {
    constexpr double signbit = -0.;
    auto const avx_signbit = _mm256_broadcast_sd(&signbit);
    return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

assembly

The Intel Intrinsics Guide


With AVX2 avx_signbit can be generated with no constants:

__m256 copysign2_ps(__m256 from, __m256 to) {
    auto a = _mm256_castps_si256(from);
    auto avx_signbit = _mm256_castsi256_ps(_mm256_slli_epi32(_mm256_cmpeq_epi32(a, a), 31));
    return _mm256_or_ps(_mm256_and_ps(avx_signbit, from), _mm256_andnot_ps(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

__m256d copysign2_pd(__m256d from, __m256d to) {
    auto a = _mm256_castpd_si256(from);
    auto avx_signbit = _mm256_castsi256_pd(_mm256_slli_epi64(_mm256_cmpeq_epi64(a, a), 63));
    return _mm256_or_pd(_mm256_and_pd(avx_signbit, from), _mm256_andnot_pd(avx_signbit, to)); // (avx_signbit & from) | (~avx_signbit & to)
}

Still though, both clang and gcc calculate avx_signbit at compile time and replace it with constants loaded from .rodata section, which is, IMO, sub-optimal.

Schroder answered 10/9, 2019 at 14:11 Comment(11)
Is it bad or unreliable to use -0.0 instead of that integer constant and the memcpy?Hauptmann
@harold You are right, -0. generates exactly the same code. I was hoping for the compiler to generate movss and movsd instructions without using constants in the data section, however, the compiler generated those avx_sigbit in .data anyway.Schroder
Thanks for your answer. That is basically what I added to my original post. I just use _mm_set1_ps instead of broadcast. Is there any reason to favour one about the other? And maybe you could have a look at the problem with 'NaN' that I added to my question.Basinet
@Maxim Egorushkin - Thank you for the clarification. That explains why I get incorrect results with the second version.Basinet
With AVX512 you can use vpternlogd as a bit-blend to copy just the sign bits in one instruction. (Still need a set1(-0.0) mask though).Abhorrence
@wychmaster: _mm_broadcast_ss() gets GCC to compress the constant, vs. foolishly repeating the same thing for 32 bytes. godbolt.org/z/8HN7kl. Clang already does this optimization for set1_ps(). But clang shoots itself in the foot by transforming andnot into and and inventing a 2nd constant, and compressing them to scalar means neither of them are used as a memory-source operand, instead loaded separately with broadcast loads. godbolt.org/z/nqS7HuAbhorrence
@PeterCordes Yep, IMO, those extra constants in .data are a pessimization.Schroder
They're in .rodata of course, but yes two loads are clearly worse than one in this case, especially with AVX for non-destructive operations.Abhorrence
@PeterCordes Added AVX2 versions, the compilers are still being too smart.Schroder
Usually you'd use this inside a loop where the compiler could hoist the loads anyway. One broadcast load from .rodata is probably good, and very likely to hit in at least L3 cache, or L2 if it's used often enough to make much perf difference.Abhorrence
@PeterCordes Thank you for the explanation. However, I have to admit, that I am not so familiar with assembly. Even though I understand, that an additional instruction (clang - broadcast) reduces performance, I am not really sure what the problem with GCC is. I see, that the constant for _mm_set1_ps has 4 values instead of 1. Does that mean, that all values are loaded one after each other or what is the problem here? And where do I find or what is .data and .rodata?Basinet
T
5

Here's a version that I think is slightly better than the accepted answer if you target icc:

__m256d copysign_pd(__m256d from, __m256d to) {
    __m256d const avx_sigbit = _mm256_set1_pd(-0.);
    return _mm256_or_pd(_mm256_and_pd(avx_sigbit, from), _mm256_andnot_pd(avx_sigbit, to));
}

It uses _mm256_set1_pd rather than an broadcast intrinsic. On clang and gcc this is mostly a wash, but on icc the broadcast version actually writes a constant to the stack and then broadcasts from it, which is ... terrible.

Godbolt showing AVX-512 code, adjust the -march= to -march=skylake to see AVX2 code.

Here's an untested AVX-512 version which uses vpterlogdq directly, which compiles down to a single vpterlogd instruction on icc and clang (gcc includes a separate broadcast):

__m512d copysign_pd_alt(__m512d from, __m512d to) {
    const __m512i sigbit = _mm512_castpd_si512(_mm512_set1_pd(-0.));
    return _mm512_castsi512_pd(_mm512_ternarylogic_epi64(_mm512_castpd_si512(from), _mm512_castpd_si512(to), sigbit, 0xE4));
}

You could make a 256-bit version of this for when AVX-512 is enabled but you're dealing with __m256* vectors.

Tasker answered 12/7, 2020 at 21:11 Comment(4)
With AVX-512, I think vpternlogd can do this in one instruction, using the same mask constant. Oh, some compilers already optimize your code to that, but not all. Perhaps worth doing that explicitly for an AVX512 version.Abhorrence
@PeterCordes - yes, clang gets it down to 1 ternlog, icc gets close (but only takes the simple approach of compressing 2 bitops to 1 ternlog, leaving one bit op still there), gcc fails.Tasker
@PeterCordes - added the AVX-512 version, but it took me 15 frigging minutes to look up the magic constants to use to calculate the vpterndlogq constants. Oddly, clang needed the epi64 version of ternlog not the epi32 one to generate a broadcast. Without a kmask, those are identical operations no? Neither gcc or icc required that (gcc seems to create a 128-bit constant but then only broadcast from 64 bits of it).Tasker
Yes, they're identical except for the broadcast element size. Compilers are dumb :/ (or their internals were designed before broadcast memory constants were possible, and haven't yet gotten good mechanisms for decisions). At least they're doing better than they used to, not making full 256 or 512-bit memory constants when the only uses are with broadcast ops.Abhorrence

© 2022 - 2024 — McMap. All rights reserved.