Add saturate 32-bit signed ints intrinsics?
Asked Answered
O

3

3

Can someone recommend a fast way to add saturate 32-bit signed integers using Intel intrinsics (AVX, SSE4 ...) ?

I looked at the intrinsics guide and found _mm256_adds_epi16 but this seems to only add 16-bit ints. I don't see anything similar for 32 bits. The other calls seem to wrap around.

Ostiole answered 7/4, 2015 at 18:41 Comment(3)
as mentioned in Is there a way to subtract packed unsigned doublewords, saturated, on x86, using MMX/SSE? use subus(a, b) == max(a, b) - b with SSE4.1's pmaxudChesterfieldian
@phuclv: This question is about signed saturation, which is a harder problem. That link is useful for unsigned saturation, which looks similar but requires a different implementation.Blackfellow
See Signed saturated add of 64-bit ints? for scalar signed saturation with a GNU C builtin to detect signed overflow efficiently. (Using integer flags; if it did auto-vectorize it would need more instructions.)Blackfellow
D
3

A signed overflow will happen if (and only if):

  • the signs of both inputs are the same, and
  • the sign of the sum (when added with wrap-around) is different from the input

Using C-Operators: overflow = ~(a^b) & (a^(a+b)).

Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1 trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps) this can be implemented as:

__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
    const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    // normal result (possibly wraps around)
    __m128i res      = _mm_add_epi32( a, b );

    // If result saturates, it has the same sign as both a and b
    __m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
    __m128i saturated = _mm_add_epi32(int_max, sign_bit);

    // saturation happened if inputs do not have different signs, 
    // but sign of result is different:
    __m128i sign_xor  = _mm_xor_si128( a, b );
    __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));

    return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps( res ),
                                          _mm_castsi128_ps(saturated),
                                          _mm_castsi128_ps( overflow ) ) );
}

If your blendvps is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min and int_max, with the sign-bits of a. Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow) 31 bits to the right, and manual blending (using and/andnot/or).

And naturally, with AVX2 this can take __m256i variables instead of __m128i (should be very easy to rewrite).

Addendum If you know the sign of either a or b at compile-time, you can directly set saturated accordingly, and you can save both _mm_xor_si128 calculations, i.e., overflow would be _mm_andnot_si128(b, res) for positive a and _mm_andnot(res, b) for negative a (with res = a+b).

Test case / demo: https://godbolt.org/z/v1bsc85nG

Dianoia answered 11/6, 2019 at 13:17 Comment(6)
The other trick is that 2's complement int_min = ~int_max, so you can XOR with a compare result or with a _mm_srai_epi32 result (to broadcast the sign bit) to flip max to min based on the sign of one of the inputs. See Signed saturated add of 64-bit ints? for a non-vectorized C version using GNU C __builtin_saddll_overflow to get an overflow flag result from an add instruction, for a branch or cmov.Blackfellow
blendvps is great on Ryzen (single uop per lane even for the VEX encoding). On Intel it's pretty bad (2p5) until Skylake, when the SSE version is 1 uop for any vector ALU (p015), and the VEX version is 2 uops (2p015) regardless of 128 vs. 256-bit. But unlike FP shuffles, FP blends have extra bypass latency between integer vector operations like add, on SnB-family. Still probably worth it vs. broadcasting the sign bit for vpblendvb integer byte-blend though.Blackfellow
You might want to use b as the input that determines the saturation value. People are more likely to write x = sadd(x, 123) than x = sadd(123, x), and a compile-time constant input lets the srli / add optimize away after inlining.Blackfellow
@PeterCordes Regarding the operand order: True, compilers (both clang and gcc) did figure out the value of saturated, but noticing that both xor-operations are redundant apparently is too difficult. I added a sentence about using possible compile-time knowledge (also works, if you know only the sign, but not the magnitude of one argument).Dianoia
This doesn't pass my tests… 405785285 + 833772085 shouldn't saturate, but does.Arboreal
@nemequ: Indeed, godbolt.org/z/cTWcK6GP8 has a test case. Inputs that should saturate don't. Reversing first 2 inputs to _mm_blendv_ps fixes the problem: godbolt.org/z/v1bsc85nG. (I edited the answer to fix the bug.)Blackfellow
A
1

Here is a version which works on SSE2, with improvements for SSE4.1 (_mm_blendv_ps), AVX-512VL (_mm_ternarylogic_epi32), and AVX-512DQ (_mm_movepi32_mask, on Peter Cordes' suggestion).

__m128i __mm_adds_epi32( __m128i a, __m128i b) {
  const __m128i int_max = _mm_set1_epi32(INT32_MAX);

  /* normal result (possibly wraps around) */
  const __m128i res = _mm_add_epi32(a, b);

  /* If result saturates, it has the same sign as both a and b */
  const __m128i sign_bit = _mm_srli_epi32(a, 31); /* shift sign to lowest bit */

  #if defined(__AVX512VL__)
    const __m128i overflow = _mm_ternarylogic_epi32(a, b, res, 0x42);
  #else
    const __m128i sign_xor = _mm_xor_si128(a, b);
    const __m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a, res));
  #endif

  #if defined(__AVX512DQ__) && defined(__AVX512VL__)
    return _mm_mask_add_epi32(res, _mm_movepi32_mask(overflow), int_max, sign_bit);
  #else
    const __m128i saturated = _mm_add_epi32(int_max, sign_bit);

    #if defined(__SSE4_1__)
      return
        _mm_castps_si128(
          _mm_blendv_ps(
            _mm_castsi128_ps(res),
            _mm_castsi128_ps(saturated),
            _mm_castsi128_ps(overflow)
          )
        );
    #else
      const __m128i overflow_mask = _mm_srai_epi32(overflow, 31);
      return
        _mm_or_si128(
          _mm_and_si128(overflow_mask, saturated),
          _mm_andnot_si128(overflow_mask, res)
        );
    #endif
  #endif
}

I did this for SIMDe's implementation of the NEON vqaddq_s32 (and the MSA __msa_adds_s_b); if you need other versions you should be able to adapt them from simde/arm/neon/qadd.h. For 128-bit vectors, in addition to what SSE supports (8/16-bit, both signed and unsigned) there are:

  • vaddq_s32 (think _mm_adds_epi32)
  • vaddq_s64 (think _mm_adds_epi64)
  • vaddq_u32 (think _mm_adds_epu32)

vaddq_u64 (think _mm_adds_epu64) is also present, but currently relies on vector extensions. I could (and probably should) just port generated code to intrinsics, but TBH I'm not sure how to improve on it so I haven't bothered.

Arboreal answered 15/7, 2021 at 17:37 Comment(6)
With AVX-512, it might be a win to replace the blend with a merge-masked add of INT32_MAX + a>>31, merging into r, using a mask from ternlog -> _mm_movepi32_mask (VPMPOVD2M). I guess that just trades a blend for a vec->mask instruction for throughput, but VEX-coded blendvps costs 2 uops. As far as critical-path latency, a merge-masking vpaddd xmm{k}, xmm, xmm is 1 cycle latency with no extra bypass latency, but vpmovd2m has 3c latency. (uops.info). With AVX1 vblendvps, there's ILP for the add.Blackfellow
I'm not sure what you're thinking about with the add; the result of the vpternlogd already has the data necessary for a vpmovd2m. llvm-mca doesn't like substituting the blend with a vpmovd2m+vmovdqa32: godbolt.org/z/dq5YhhrGh. It passes my tests, but throughput is a bit worse (and requires AVX-512DQ instead of just AVX-512VL). Or are you thinking about something else? Thanks for fixing the other version; I'll update my code to incorporate that soon since it's a little faster than what I have now.Arboreal
I mean merge-masking r = _mm_mask_add_epi32(r, d2m_saturated_mask, _mm_srli_epi32(a, 31), _mm_set1_epi32(INT32_MAX)) to only ever generate the saturated value in the elements that did saturate. (That might need the mask inverted, which could be accomplished by VPTESTNMD with a mask if necessary.)Blackfellow
Ah, you're talking about in the original version. In what I had posted both values were used when calculating the mask so that wouldn't work. You're right, with that change it's slightly faster; I've updated my post with an updated version.Arboreal
Ah, I hadn't looked at the details of how you computed the overflow condition, and hadn't noticed you just updated a instead of defining a new var. Yeah, separating that is good for instruction-level parallelism, as well as saving uops by using the more efficient overflow detection.Blackfellow
Perhaps even better to use INT_MAX ^ (a>>31) arithmetic instead of logical right shift; some older CPUs (e.g. Intel before Skylake) can run pxor on more ports than paddd, so whatever surrounding port pressure can maybe be better balanced out. godbolt.org/z/Me8xdsbjj Only downside is when AVX-512 is available, compilers failing to use EVEX vpxord with a dword broadcast memory source the way they do for _mm_add_epi32. GCC uses qword, clang uses VEX-coded with a full 16-byte memory operand.Blackfellow
O
0

This link answers this very question:

https://software.intel.com/en-us/forums/topic/285219

Here's an example implementation:

#include <immintrin.h>

__m128i __inline __mm_adds_epi32( __m128i a, __m128i b )
{
    static __m128i int_min = _mm_set1_epi32( 0x80000000 );
    static __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );

    __m128i res      = _mm_add_epi32( a, b );
    __m128i sign_and = _mm_and_si128( a, b );
    __m128i sign_or  = _mm_or_si128( a, b );

    __m128i min_sat_mask = _mm_andnot_si128( res, sign_and );
    __m128i max_sat_mask = _mm_andnot_si128( sign_or, res );

    __m128 res_temp = _mm_blendv_ps(_mm_castsi128_ps( res ),
                                    _mm_castsi128_ps( int_min ),
                                    _mm_castsi128_ps( min_sat_mask ) );

    return _mm_castps_si128(_mm_blendv_ps( res_temp,
                                          _mm_castsi128_ps( int_max ),
                                          _mm_castsi128_ps( max_sat_mask ) ) );
}

void addSaturate(int32_t* bufferA, int32_t* bufferB, size_t numSamples)
{
    //
    // Load and add
    //
    __m128i* pSrc1 = (__m128i*)bufferA;
    __m128i* pSrc2 = (__m128i*)bufferB;

    for(int i=0; i<numSamples/4; ++i)
    {
        __m128i res = __mm_adds_epi32(*pSrc1, *pSrc2);
        _mm_store_si128(pSrc1, res);

        pSrc1++;
        pSrc2++;
    }
}
Ostiole answered 8/4, 2015 at 18:43 Comment(3)
You should probably put a real example here since links only questions are not recommended (if page disappear or url has changed this make the answer useless...)Ojibwa
Sure. Will do after lunch ;)Ostiole
int_min = int_max + 1, so this would be more efficient with tmp = _mm_srli_epi32(a, 1) and _mm_add_epi32(max, tmp) to select the saturation limit. (A non-negative a can only overflow at the high end; even the most-negative b can't overflow that way.)Blackfellow

© 2022 - 2024 — McMap. All rights reserved.