What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?

Asked 7/12, 2020 at 23:45 Answered 9/12, 2020 at 19:4

Solved assembly arm simd webassembly neon

This question was originally posed for SSE2 here. Since every single algorithm overlapped with ARMv7a+NEON's support for the same operations, the question was updated to include the ARMv7+NEON versions. At the request of a commenter, this question is asked here to show that it is indeed a separate topic and to provide alternative solutions that might be more practical for ARMv7+NEON. The net purpose of these questions is to find ideal implementations for consideration into WebAssembly SIMD.

Yila answered 7/12, 2020 at 23:45 Comment(0)

Signed 64-bit saturating subtract.

Assuming my tests using _mm_subs_epi16 are correct and translate to 1:1 to NEON...

uint64x2_t pcmpgtq_armv7 (int64x2_t a, int64x2_t b) {
    return vreinterpretq_u64_s64(vshrq_n_s64(vqsubq_s64(b, a), 63));
}

Would certainly seem to be the fastest achievable way to emulate pcmpgtq.

The free chapter of Hacker's Delight gives the following formulas:

// return (a > b) ? -1LL : 0LL; 
int64_t cmpgt(int64_t a, int64_t b) {
    return ((b & ~a) | ((b - a) & ~(b ^ a))) >> 63; 
}

int64_t cmpgt(int64_t a, int64_t b) {
    return ((b - a) ^ ((b ^ a) & ((b - a) ^ b))) >> 63;
}

Fung answered 9/12, 2020 at 19:4 Comment(4)

the question with respect to Webassembly if if either of these solutions will be more efficient than scalarization. Do you have a way of finding out? – Yila 10/12, 2020 at 20:17

x64 doesn't appear to have a saturating subtract for 64 bits nor a portable psraq. However, is there a solution like this that is faster or more viable than the best sse2 solution? – Yila 14/12, 2020 at 20:41

@DanWeber, the only thing I can think of for SSE2 would be to abuse floating-point instructions... _mm_cmpge_pd(), etc. – Fung 14/12, 2020 at 20:49

this answer was cited in the official proposal for NEONv7 signed support. github.com/WebAssembly/simd/pull/412#issue-544657198 – Yila 24/12, 2020 at 17:36

From the original post, the best x64/SSE2 algorithm implemented on ARMv7+NEON works as follows:

(a[32:63] === b[32:63]) & (b[0:63] - a[0:63]) yields a mask of 0xFFFFFFFF......... for every case where the top 32 bits are equal and a[0:31] > b[0:31]. In all other cases such as when the top 32 bits are not equal or a[0:31]< b[0:31], it returns 0x0. This has the effect of taking the bottom 32bits of each integer and propagating them into the upper 32bits as a mask if the top 32 bits are inconsequential, and the lower 32bits are significant. For the remaining cases, it takes the comparison of the top 32 bits and ORs them together. As an example if a[32:63] > b[32:63], then a is definitely greater than b, regardless of the least significant bits. Finally, it swizzles/shuffles/transposes the upper 32s of each 64bit mask to the lower 32bits to produce a full 64bit mask.

An illustrative example implementation is in this Godbolt.

Yila answered 7/12, 2020 at 23:45 Comment(3)

Consider replacing vtrn.32 with vshr.s64. – Fung 8/12, 2020 at 21:3

What is the cost of vbsl and should it replace {and/or} ? – Fung 8/12, 2020 at 21:23

I'm not sure. I can't seem to get llvm mca to work properly with ARMv7. – Yila 9/12, 2020 at 2:51