How to simulate pcmpgtq on sse2?

Asked 6/12, 2020 at 8:36 Answered 7/12, 2020 at 3:28

PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask.

How does one support this functionality on instructions sets predating sse4.2?

Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?

Smalltime answered 6/12, 2020 at 8:36 Comment(5)

You could try something with comparing 32-bit components, and cleverly combining the results -- or doing a 64 subtraction and checking the signs of the operands and the result. But I doubt that this is worth the effort, compared to doing two scalar comparisons (you'd need to show more context) – Essy 6/12, 2020 at 10:14

We're evaluating whether or not to support this in Webassembly SIMD when we can reasonably expect the data to already be in vectors. – Smalltime 6/12, 2020 at 20:50

I'm not sure it's a good idea to add NEON to the question after the fact when there are a bunch of x86-only answers, but if NEON has immediate 32-bit-granularity shuffles and 64-bit SIMD integer subtract then aqrit's strategy / algorithm should work there, too. If you really want an actual NEON asm version of the same question, ask it again with a link to this as the x86 version. – Gimcrackery 7/12, 2020 at 17:57

Generally high-level languages with a perf focus should provide a wide set of primitives for anything that's efficient in HW on any machine, not just all or most machines. C failed to do that, and it's a huge pain in the ass (or impossible) to get anything done efficiently in pure C beyond the basics (popcount, bit-scan, add-with-carry-in, even rotates require care to avoid UB and still compile efficiently). Rust chose more wisely for integer operations. It does make sense to document that some operations are extra slow on some machines. – Gimcrackery 8/12, 2020 at 3:7

But I don't think it makes sense to consider gimping a JITed language like webasm for the majority of machines just because you'd need a clunky SIMD or scalar fallback on some machines. Especially not when you already showed that existing compilers can implement something for native vectors. – Gimcrackery 8/12, 2020 at 3:8

__m128i pcmpgtq_sse2 (__m128i a, __m128i b) {
    __m128i r = _mm_and_si128(_mm_cmpeq_epi32(a, b), _mm_sub_epi64(b, a));
    r = _mm_or_si128(r, _mm_cmpgt_epi32(a, b));
    return _mm_shuffle_epi32(r, _MM_SHUFFLE(3,3,1,1));
}

We have 32-bit signed comparison intrinsics so split the packed qwords into dwords pairs.

If the high dword in a is greater than the high dword in b then there is no need to compare the low dwords.

if (a.hi > b.hi) { r.hi = 0xFFFFFFFF; }
if (a.hi <= b.hi) { r.hi = 0x00000000; }

If the high dword in a is equal to the high dword in b then a 64-bit subtract will either clear or set all 32 high bits of the result (if the high dwords are equal then they "cancel" each other out, effectively a unsigned compare of the low dwords, placing the result in the high dwords).

if (a.hi == b.hi) { r = (b - a) & 0xFFFFFFFF00000000; }

Copy the comparison mask in the high 32-bits to the low 32-bits.

r.lo = r.hi

Updated: Here's the Godbolt

Turner answered 7/12, 2020 at 3:28 Comment(3)

That's quite nice -- avoiding any arithmetic shifts to broadcast the sign-bit to all elements. I'd recommend to add a bit more explanation, though. – Essy 7/12, 2020 at 13:28

Yup, compiles nicely on godbolt.org/z/nMzx6T. psubq is 2 uops and 2 cycle latency on Conroe/Wolfdale (the last Intel CPU before Nehalem introduced SSE4.2 pcmpgtq) vs. 1 uop / 1 cycle for most SIMD integer math uops.info/table.html, but is still only 9 total uops (with clang) so this looks like a win vs. other answers. (But maybe similar to bouncing to scalar and back, maybe worth using between other SIMD code but probably not if data is scalar or in memory to start with.) – Gimcrackery 7/12, 2020 at 14:43

This looks really good guys. I'm going to profile this later today on ARMv7+Neon to see if it'll be effective there too. – Smalltime 7/12, 2020 at 16:30

The code below demonstrates that the desired functionality can be emulated using pre-existing SSE instructions by working from first principles: signed "less than" predicate being true is equivalent to the overflow flag differing from the sign flag after subtraction. Obviously we do not have flags, so the flags need to be emulated as well, leaving the resulting predicate in the most significant bit (MSB) of each quadword. In a second step we can then expand the MSB into a quadword-sized mask.

The resulting fairly lengthy code is a pretty good indication that this is probably not the best way of emulating the desired functionality from a performance perspective, even though it is functionally correct. Compiler Explorer (godbolt.org) shows twelve instructions when compiling with Clang 11 (please refer to the assembly listing below).

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "nmmintrin.h"

#define  USE_SSE42_REF (0)
#define  NBR_OF_TESTS  (1000000000)

#if USE_SSE42_REF
__m128i pcmpgtq_ref (__m128i a, __m128i b)
{
    return _mm_cmpgt_epi64 (a, b);
}
#else // USE_SSE42_REF
__m128i pcmpgtq_ref (__m128i a, __m128i b)
{
    __m128i r;
    struct {
        int64_t lo;
        int64_t hi;
    } hilo_a, hilo_b, hilo_r;
    memcpy (&hilo_a, &a, sizeof hilo_a);
    memcpy (&hilo_b, &b, sizeof hilo_b);
    hilo_r.lo = hilo_a.lo > hilo_b.lo ? (-1LL) : 0LL;
    hilo_r.hi = hilo_a.hi > hilo_b.hi ? (-1LL) : 0LL;
    memcpy (&r, &hilo_r, sizeof r);
    return r;
}
#endif // USE_SSE42_REF

/* "signed less than" == (OF != SF); compute predicate in MSB of each byte */
__m128i ltq_core (__m128i a, __m128i b)
{
    __m128i m = _mm_set1_epi64x (0x7fffffffffffffffULL);
    __m128i c = _mm_and_si128 (b, m);
    __m128i d = _mm_andnot_si128 (a, m);
    __m128i t = _mm_add_epi64 (c, d);
    __m128i s = _mm_xor_si128 (a, b);
    __m128i x = _mm_xor_si128 (a, t);
    __m128i y = _mm_and_si128 (x, s);
    __m128i r = _mm_xor_si128 (y, t);
    return r;
}

/* extend sign bits into mask, quadword-wise */
__m128i q_sign_to_mask (__m128i a)
{
    __m128i q = _mm_set1_epi64x (0);
    __m128i s = _mm_srli_epi64 (a, 63);
    __m128i r = _mm_sub_epi64 (q, s);
    return r;
}

__m128i pcmpltq (__m128i a, __m128i b)
{
    return q_sign_to_mask (ltq_core (a, b));
}

__m128i pcmpgtq (__m128i a, __m128i b)
{
    return pcmpltq (b, a);
}

/*
  https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
  From: geo <[email protected]>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/

static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;

#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

int main (void)
{
    struct {
        uint64_t lo;
        uint64_t hi;
    } hilo_a, hilo_b, hilo_res, hilo_ref;

    for (int i = 0; i < NBR_OF_TESTS; i++) {
        uint64_t al = KISS64;
        uint64_t ah = KISS64;
        uint64_t bl = KISS64;
        uint64_t bh = KISS64;

        if ((i & 0xff) == 0x00) bl = al; // increase chance of equality
        if ((i & 0xff) == 0xff) bh = ah; // increase chance of equality

        __m128i a = _mm_set_epi64x (ah, al);
        __m128i b = _mm_set_epi64x (bh, bl);

        __m128i res = pcmpgtq (a, b);
        __m128i ref = pcmpgtq_ref (a, b);
        
        memcpy (&hilo_res, &res, sizeof hilo_res);
        memcpy (&hilo_ref, &ref, sizeof hilo_ref);

        if ((hilo_res.hi != hilo_ref.hi) || (hilo_res.lo != hilo_ref.lo)) {
            memcpy (&hilo_a, &a, sizeof hilo_a);
            memcpy (&hilo_b, &b, sizeof hilo_b); 
            printf ("error: a=%016llx_%016llx  b=%016llx_%016llx  res=%016llx_%016llx  ref=%016llx_%016llx\n",
                    hilo_a.hi, hilo_a.lo, hilo_b.hi, hilo_b.lo,
                    hilo_res.hi, hilo_res.lo, hilo_ref.hi, hilo_ref.lo);
            return EXIT_FAILURE;
        }
    }
    return EXIT_SUCCESS;
}

When compiling with -msse2, the output from Clang 11 looks as follows:

.LCPI4_0:
        .quad   9223372036854775807             # 0x7fffffffffffffff
        .quad   9223372036854775807             # 0x7fffffffffffffff
pcmpgtq:                                # @pcmpgtq
        movdqa  xmm2, xmmword ptr [rip + .LCPI4_0] # xmm2 = [9223372036854775807,9223372036854775807]
        movdqa  xmm3, xmm1
        pxor    xmm3, xmm0
        pand    xmm0, xmm2
        movdqa  xmm4, xmm1
        pandn   xmm4, xmm2
        paddq   xmm4, xmm0
        pand    xmm1, xmm3
        pandn   xmm3, xmm4
        por     xmm3, xmm1
        psrad   xmm3, 31
        pshufd  xmm0, xmm3, 245                 # xmm0 = xmm3[1,1,3,3]
        ret

Burtburta answered 6/12, 2020 at 12:58 Comment(12)

In q_sign_to_mask, _mm_srli_epi64 (p, 63); can take its input from a directly, no need to mask away the bits you're shifting out anyway. And won't _mm_add_epi64 (p, p); always produce zero? You're effectively left shifting left by 1, after clearing all but the high bit, right? Also, a more efficient implementation might use pshufd to replicate the high dword to both halves, then pcmpgtd to do the same 32-bit compare like _mm_cmpgt_epi32(0, _mm_shuffle_epi32(a, _MM_SHUFFLE(3,3,1,1)) producing 0 or -1 according to the sign bit of each input qword, emulating srai(a, 63) – Gimcrackery 6/12, 2020 at 14:42

Oh, or _mm_srai_epi32(_mm_shuffle_epi32(a, _MM_SHUFFLE(3,3,1,1)), 31) to use 32-bit arithmetic right shift in each half of the qword. – Gimcrackery 6/12, 2020 at 14:42

The OP's compiler-generated code using pcmpgtd as a building block looks pretty good, compiled from GNU C native vectors of int64_t. – Gimcrackery 6/12, 2020 at 19:53

@PeterCordes is that specific to this answer or the other one? – Smalltime 6/12, 2020 at 20:12

@DanWeber: Is what specific to this answer? Are you talking about one of my earlier comments about the detailfs of this answer? – Gimcrackery 6/12, 2020 at 20:22

@PeterCordes Thanks for pointing out various sub-optimalities. I wrote the code between 3am and 5am just before heading for bed so my mental acuity was low. I have now simplified q_sign_to_mask based on the deficiencies you pointed out; the compiler seems to do the rest. – Burtburta 6/12, 2020 at 20:23

@njuffa: Yup, I've had the same experience, over-complicating something the first time through. Please don't encourage people to write intrinsics like _mm_xor_si128 (a, a). Use _mm_setzero_si128() or _mm_set1_epi32(0) without any implied source value and let the compiler use xor-zeroing. (GCC and clang optimize away the dependency, but it seems unwise). Also note that clang optimizes the code into almost what I suggested: psrad / pshufd, instead of pxor / psrlq / psubq, so a completely different strategy for sign-extension. GCC would likely be more faithful. – Gimcrackery 6/12, 2020 at 20:25

Too bad clang doesn't see that it can reuse the same constant with pandn, although that might actually cost an extra instruction to copy it vs. using a memory source. – Gimcrackery 6/12, 2020 at 20:27

@PeterCordes I was asking whether or not the compiler optimization was referring to this answer or the other one which definitely has compiler optimization. Err. I see both do.. – Smalltime 6/12, 2020 at 20:34

@PeterCordes clang seems not to like using pandn at all (except for very basic expressions): E.g.: godbolt.org/z/h3dorW – Essy 6/12, 2020 at 23:37

@chtz: Wow, yeah, even with -O3 instead of -O1. That anti-optimization looks like a regression from clang6 (pandn) to clang7 (pcmpedq/pxor/pand). – Gimcrackery 7/12, 2020 at 0:21

@DanWeber: My initial comments on this question were before the edit, just assuming that intrinsics would get mapped directly to asm instructions (with extra movdqa where necessary to deal with destructive non-VEX instructions). – Gimcrackery 7/12, 2020 at 0:23

I'm not sure if this is the most optimal output, but this is the output for x64 from Clang. I've also taken the same implementation and converted it to support armv7 with neon.

See the Godbolt and the assembly below:

.LCPI0_0:
        .quad   2147483648                      # 0x80000000
        .quad   2147483648                      # 0x80000000
cmpgtq_sim(long __vector(2), long __vector(2)):                  # @cmpgtq_sim(long __vector(2), long __vector(2))
        movdqa  xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [2147483648,2147483648]
        pxor    xmm1, xmm2
        pxor    xmm0, xmm2
        movdqa  xmm2, xmm0
        pcmpgtd xmm2, xmm1
        pshufd  xmm3, xmm2, 160                 # xmm3 = xmm2[0,0,2,2]
        pcmpeqd xmm0, xmm1
        pshufd  xmm1, xmm0, 245                 # xmm1 = xmm0[1,1,3,3]
        pand    xmm1, xmm3
        pshufd  xmm0, xmm2, 245                 # xmm0 = xmm2[1,1,3,3]
        por     xmm0, xmm1

ARMv7+Neon

cmpgtq_sim(__simd128_int64_t, __simd128_int64_t):
        vldr    d16, .L3
        vldr    d17, .L3+8
        veor    q0, q0, q8
        veor    q1, q1, q8
        vcgt.s32        q8, q0, q1
        vceq.i32        q1, q0, q1
        vmov    q9, q8  @ v4si
        vmov    q10, q1  @ v4si
        vtrn.32 q9, q8
        vmov    q0, q8  @ v4si
        vmov    q8, q1  @ v4si
        vtrn.32 q10, q8
        vand    q8, q8, q9
        vorr    q0, q8, q0
        bx      lr
.L3:
        .word   -2147483648
        .word   0
        .word   -2147483648
        .word   0

Smalltime answered 6/12, 2020 at 18:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags