Is there a way to subtract packed unsigned doublewords, saturated, on x86, using MMX/SSE?
Asked Answered
A

1

5

I've been looking at MMX/SSE and I am wondering. There are instructions for packed, saturated subtraction of unsigned bytes and words, but not doublewords.

Is there a way of doing what I want, or if not, why is there none?

Antonioantonius answered 10/6, 2019 at 12:6 Comment(10)
You can use compare and mask. As to why it doesn't exist as a single instruction, it's anybody's guess.Largescale
I don't understand. How would I do that?Nawrocki
Are your source values signed or unsigned ? It’s fairly easy if the inputs are unsigned, slightly trickier if they are signed.Trinatte
Check out how LLVM auto-vectorizes Rust u32.saturating_sub(): godbolt.org/z/huP4PX - range-shift to signed with PXOR, then PCMPGTD signed-compare, then AND/ANDN/OR to apply saturation to a PSUBD result. I'm not sure this is optimal; it should just need PAND because the only saturation case for unsigned subtraction is saturation to 0.Rheometer
You can use subus(a, b) == max(a, b) - b - E: well that's good with SSE4.1, does MMX/SSE mean literally only MMX and SSE?Chordate
@harold: oh yes, that's very good with SSE4.1 for pmaxud.Rheometer
wow, these blow up my budget. i was asking my question, because i wanted to avoid cmp at all.PaulR unsigned, as stated in the Q. I guess SSE4 would be fine? I'd have to check. Peter, @harold thank you for your suggestions, i will look into the performance of these.Nawrocki
I don't know which of your commentswould be the best answer...Nawrocki
@Antonioantonius The best solution depends on some context, e.g., do you care about throughput, latency, portability, etc.Treviso
@Treviso throughput only. i haven't yet had the time to sit down with the suggestions.Nawrocki
T
3

If you have SSE4.1 available, I don't think you can get better than using the pmaxud+psubd approach suggested by @harold. With AVX2, you can of course also use the corresponding 256bit variants.

__m128i subs_epu32_sse4(__m128i a, __m128i b){
    __m128i mx = _mm_max_epu32(a,b);
    return _mm_sub_epi32(mx, b);
}

Without SSE4.1, you need to compare both arguments in some way. Unfortunately, there is no epu32 comparison (not before AVX512), but you can simulate one by first adding 0x80000000 (which is equivalent to xor-ing in this case) to both arguments:

__m128i cmpgt_epu32(__m128i a, __m128i b) {
    const __m128i highest = _mm_set1_epi32(0x80000000);
    return _mm_cmpgt_epi32(_mm_xor_si128(a,highest),_mm_xor_si128(b,highest));
}

__m128i subs_epu32(__m128i a, __m128i b){
    __m128i not_saturated = cmpgt_epu32(a,b);
    return _mm_and_si128(not_saturated, _mm_sub_epi32(a,b));
}

In some cases, it might be better to replace the comparison by some bit-twiddling of the highest bit and broadcasting that to every bit using a shift (this replaces a pcmpgtd and three bit-logic operations (and having to load 0x80000000 at least once) by a psrad and five bit-logic operations):

__m128i subs_epu32_(__m128i a, __m128i b) {
    __m128i r = _mm_sub_epi32(a,b);
    __m128i c = (~a & b) | (r & ~(a^b)); // works with gcc/clang. Replace by corresponding intrinsics, if necessary (note that `andnot` is a single instruction)
    return _mm_srai_epi32(c,31) & r;
}

Godbolt-Link, also including adds_epu32 variants: https://godbolt.org/z/n4qaW1 Strangely, clang needs more register copies than gcc for the non-SSE4.1 variants. On the other hand, clang finds the pmaxud optimization for the cmpgt_epu32 variant when compiled with SSE4.1: https://godbolt.org/z/3o5KCm

Treviso answered 17/6, 2019 at 14:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.