A signed overflow will happen if (and only if):
- the signs of both inputs are the same, and
- the sign of the sum (when added with wrap-around) is different from the input
Using C-Operators: overflow = ~(a^b) & (a^(a+b))
.
Also, if an overflow happens, the saturated result will have the same sign as either input. Using the int_min = int_max+1
trick suggested by @PeterCordes, and assuming you have at least SSE4.1 (for blendvps
) this can be implemented as:
__m128i __mm_adds_epi32( __m128i a, __m128i b )
{
const __m128i int_max = _mm_set1_epi32( 0x7FFFFFFF );
// normal result (possibly wraps around)
__m128i res = _mm_add_epi32( a, b );
// If result saturates, it has the same sign as both a and b
__m128i sign_bit = _mm_srli_epi32(a, 31); // shift sign to lowest bit
__m128i saturated = _mm_add_epi32(int_max, sign_bit);
// saturation happened if inputs do not have different signs,
// but sign of result is different:
__m128i sign_xor = _mm_xor_si128( a, b );
__m128i overflow = _mm_andnot_si128(sign_xor, _mm_xor_si128(a,res));
return _mm_castps_si128(_mm_blendv_ps( _mm_castsi128_ps( res ),
_mm_castsi128_ps(saturated),
_mm_castsi128_ps( overflow ) ) );
}
If your blendvps
is as fast (or faster) than a shift and an addition (also considering port usage), you can of course just blend int_min
and int_max
, with the sign-bits of a
.
Also, if you have only SSE2 or SSE3, you can replace the last blend by an arithmetic shift (of overflow
) 31 bits to the right, and manual blending (using and/andnot/or).
And naturally, with AVX2 this can take __m256i
variables instead of __m128i
(should be very easy to rewrite).
Addendum If you know the sign of either a
or b
at compile-time, you can directly set saturated
accordingly, and you can save both _mm_xor_si128
calculations, i.e., overflow
would be _mm_andnot_si128(b, res)
for positive a
and _mm_andnot(res, b)
for negative a
(with res = a+b
).
Test case / demo: https://godbolt.org/z/v1bsc85nG
subus(a, b) == max(a, b) - b
with SSE4.1'spmaxud
– Chesterfieldian