A64 Neon SIMD - 256-bit comparison

Asked 20/4, 2015 at 8:34 Answered 28/3, 2022 at 3:8

I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently.

Equality (=)

For equality, I already got a solution:

bool eq256(const UInt256 *lhs, const UInt256 *rhs) {
    bool result;

First, load the two values into SIMD registers.

    __asm__("ld1.2d    { v0, v1 }, %1    \n\t"
            "ld1.2d    { v2, v3 }, %2    \n\t"

Compare each 64-bit limb of the values with each other. This results in -1 (all bits set) for those limbs that are equal, and 0 (all bits clear) if a bit differs.

            "cmeq.2d   v0, v0, v2        \n\t"
            "cmeq.2d   v1, v1, v3        \n\t"

Reduce the result from 2 vectors to 1 vector, keeping only the one that contains "0 (all bits clear)" if there is any.

            "uminp.16b v0, v0, v1        \n\t"

Reduce the result from 1 vector to 1 byte, keeping only a byte with zeroes if there is any.

            "uminv.16b b0, v0            \n\t"

Move to ARM register, then compare with 0xFF. This is the result.

            "umov      %w0, v0.b[0]      \n\t"
            "cmp       %w0, 0xFF         \n\t"
            "cset      %w0, eq               "
            : "=r" (result)
            : "m" (*lhs->value), "m" (*rhs->value)
            : "v0", "v1", "v2", "v3", "cc");
    return result;
}

Questions

Is this more efficient than doing the 4 comparisons with plain old ARM registers?
- e.g. is there a source that quotes timings for the different operations? I'm doing this on iPhone 5s.
Is there a way to optimize this even further? I think that I waste many cycles just to reduce the whole vector to a single scalar boolean.

Less Than comparison (<)

Let's represent the two ints as tuples of 64-bit limbs (little-endian):

lhs = (l0, l1, l2, l3)
rhs = (r0, r1, r2, r3)

Then, lhs < rhs, if this evaluates to true:

(l3 < r3) &     1     &     1     &     1     |
(l3 = r3) & (l2 < r2) &     1     &     1     |
(l3 = r3) & (l2 = r2) & (l1 < r1) &     1     |
(l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)

SIMD instructions can now be used to evaluate multiple operands at a time. Assuming (l1, l2), (l3, l4), (r1, r2), (r3, r4) is the way that the two 256-bit numbers are stored, we can easily get all of the required values (useful values in bold):

cmlo.2d => (l1 < r1), (l2 < r2)
cmlo.2d => (l3 < r3), (l4 < r4)
cmeq.2d => (l1 = r1), (l2 = r2)
cmeq.2d => (l3 = r3), (l4 = r4)

Questions

With these values in four SIMD registers, I now wonder what the best strategy is to apply the & and | operators, and then reducing it to a single boolean.

Update

I just punched together a working implementation for "less than".

Basically, I replaced the 1s above with a duplicate condition, because A & A == A & 1.

Then, I lay out the three 2x2 squares in my matrix, and bitwise AND them. Now, I reduce with bitwise ORs - first from two vectors to one vector, then to one byte, then copy to ARM register, and test for 0xFF. Same pattern as for equality above.

Question above is still valid. I'm not sure if the code is optimal yet, and wonder if I missed some general SIMD pattern to do such stuff more efficiently. Also: Is NEON worth it for such cases, when the input operands come from memory?

bool lt256(const UInt256 *lhs, const UInt256 *rhs) {
    bool result;
    __asm__(// (l3 < r3) & (l3 < r3) |
            // (l3 = r3) & (l2 < r2) |
            // (l3 = r3) & (l2 = r2) & (l1 < r1) & (l1 < r1) |
            // (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)

            "ld1.2d    { v0, v1 }, %1    \n\t"
            "ld1.2d    { v2, v3 }, %2    \n\t"

            // v0: [ l3 = r3 ] [ l2 = r2 ]
            // v1: [ l0 < r0 ] [ l1 < r1 ]
            // v2: [ l0 = r0 ] [ l1 = r1 ]
            // v3: [ l2 < r2 ] [ l3 < r3 ]
            // v4: [ l2 = r2 ] [ l3 = r3 ]
            "cmeq.2d   v4, v1, v3        \n\t"
            "cmlo.2d   v3, v1, v3        \n\t"
            "cmlo.2d   v1, v0, v2        \n\t"
            "cmeq.2d   v2, v0, v2        \n\t"
            "ext.16b   v0, v4, v4, 8     \n\t"

            // v2: [ l1 < r1 ] [ l1 = r1 ]
            // v1: [ l1 < r1 ] [ l0 < r0 ]
            "trn2.2d   v2, v1, v2        \n\t"
            "ext.16b   v1, v1, v1, 8     \n\t"

            // v1: [ l1 < r1  &  l1 < r1 ] [ l1 = r1  &  l0 < r0 ]
            "and.16b   v1, v2, v1        \n\t"

            // v2: [ l3 < r3 ] [ l3 = r3 ]
            // v3: [ l3 < r3 ] [ l2 < r2 ]
            "ext.16b   v2, v3, v0, 8     \n\t"
            "ext.16b   v3, v3, v3, 8     \n\t"

            // v3: [ l3 < r3  &  l3 < r3 ] [ l3 = r3  &  l2 < r2 ]
            "and.16b   v3, v2, v3        \n\t"

            // v2: [ l3 = r3 ] [ l3 = r3 ]
            // v4: [ l2 = r2 ] [ l2 = r2 ]
            "ext.16b   v2, v4, v0, 8     \n\t"
            "ext.16b   v4, v0, v4, 8     \n\t"

            // v2: [ l3 = r3  &  l2 = r2 ] [ l3 = r3  &  l2 = r2 ]
            "and.16b   v2, v2, v4        \n\t"

            // v1: [ l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
            //     [ lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
            "and.16b   v1, v2, v1        \n\t"

            // v1: [ l3 < r3 & l3 < r3 |
            //       l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
            //     [ l3 = r3 & l2 < r2 |
            //       lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
            "orr.16b   v1, v3, v1        \n\t"

            // b1: [ l3 < r3 & l3 < r3 |
            //       l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 |
            //       l3 = r3 & l2 < r2 |
            //       lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
            "umaxv.16b b1, v1            \n\t"
            "umov      %w0, v1.b[0]      \n\t"
            "cmp       %w0, 0xFF         \n\t"
            "cset      %w0, eq"
            : "=r" (result)
            : "m" (*lhs->value), "m" (*rhs->value)
            : "v0", "v1", "v2", "v3", "v4", "cc");
    return result;
}

Conspicuous answered 20/4, 2015 at 8:34 Comment(2)

How is UInt256 used elsewhere, i.e. are the values more likely to be in SIMD registers, General-purpose registers, or memory beforehand? I'd imagine cmp and 3 ccmps might have less overhead than a bunch of SIMD register juggling, but having to spill a bunch of GP registers and load the values is liable to tip the balance the other way. I suspect the overall efficiency question is best answered by benchmarking, as it's the kind of thing that will also be impacted by the rest of your code (register pressure, cache usage, etc.) – Homey 20/4, 2015 at 9:22

They are in memory before, and loaded with "ld1". – Conspicuous 20/4, 2015 at 9:26

Benchmark with XCTest measureMetrics with a Swift-based test runner. Two 256-bit Ints are allocated. Then, an operation is repeated 100 million times on the same two ints, measurement is stopped, and each limb of the two ints is assigned a new random value with arc4random. A second run is performed with Instruments attached, and the CPU time distribution is noted for each instruction as a comment right next to it.

Equality (==)

For equality, SIMD seems to lose when the result is transferred from the SIMD registers back to the ARM register. SIMD is probably only worth it when the result is used in further SIMD calculations, or if longer ints than 256-bit are used (ld1 seems to be faster than ldp).

SIMD

bool result;
__asm__("ld1.2d    { v0, v1 }, %1    \n\t"    //  5.1%
        "ld1.2d    { v2, v3 }, %2    \n\t"    // 26.4%
        "cmeq.2d   v0, v0, v2        \n\t"
        "cmeq.2d   v1, v1, v3        \n\t"
        "uminp.16b v0, v0, v1        \n\t"    //  4.0%
        "uminv.16b b0, v0            \n\t"    // 26.7%
        "umov      %w0, v0.b[0]      \n\t"    // 32.9%
        "cmp       %w0, 0xFF         \n\t"    //  0.0%
        "cset      %w0, eq               "
        : "=r" (result)
        : "m" (*lhs->value), "m" (*rhs->value)
        : "v0", "v1", "v2", "v3", "cc");
return result;                                //  4.9% ("ret")

measured [Time, seconds] average: 11.558, relative standard deviation: 0.065%, values: [11.572626, 11.560558, 11.549322, 11.568718, 11.558530, 11.550490, 11.557086, 11.551803, 11.557529, 11.549782]

Standard

The winner here. The ccmp instruction really shines here :-) It is clear, though, that the problem is memory bound.

bool result;
__asm__("ldp       x8, x9, %1          \n\t"  // 33.4%
        "ldp       x10, x11, %2        \n\t"
        "cmp       x8, x10             \n\t"
        "ccmp      x9, x11, 0, eq      \n\t"
        "ldp       x8, x9, %1, 16      \n\t"  // 34.1%
        "ldp       x10, x11, %2, 16    \n\t"
        "ccmp      x8, x10, 0, eq      \n\t"  // 32.6%
        "ccmp      x9, x11, 0, eq      \n\t"
        "cset      %w0, eq             \n\t"
        : "=r" (result)
        : "m" (*lhs->value), "m" (*rhs->value)
        : "x8", "x9", "x10", "x11", "cc");
return result;

measured [Time, seconds] average: 11.146, relative standard deviation: 0.034%, values: [11.149754, 11.142854, 11.146840, 11.149392, 11.141254, 11.148708, 11.142293, 11.150491, 11.139593, 11.145873]

LLVM fails to detect that "ccmp" is a good instruction to use here, and is slower than the asm version above.

return
    lhs->value[0] == rhs->value[0] &
    lhs->value[1] == rhs->value[1] &
    lhs->value[2] == rhs->value[2] &
    lhs->value[3] == rhs->value[3];

Compiled to

ldp    x8, x9, [x0]                           // 24.1%
ldp    x10, x11, [x1]                         //  0.1%
cmp    x8, x10                                //  0.4%
cset   w8, eq                                 //  1.0%
cmp    x9, x11                                // 23.7%
cset   w9, eq
and    w8, w8, w9                             //  0.1%
ldp    x9, x10, [x0, #16]
ldp    x11, x12, [x1, #16]                    // 24.8%
cmp    x9, x11
cset   w9, eq                                 //  0.2%
and    w8, w8, w9
cmp    x10, x12                               //  0.3%
cset   w9, eq                                 // 25.2%
and    w0, w8, w9
ret                                           //  0.1%

measured [Time, seconds] average: 11.531, relative standard deviation: 0.040%, values: [11.525511, 11.529820, 11.541940, 11.531776, 11.533287, 11.526628, 11.531392, 11.526037, 11.531784, 11.533786]

Less Than (<)

(to be determined - will update later)

Conspicuous answered 23/4, 2015 at 11:53 Comment(3)

Nice work! In the ccmp variant, you'd potentially save some precious memory bandwidth by loading and testing the most significant pairs first, and skipping over the second set of loads entirely if that comparison fails. – Homey 25/4, 2015 at 11:49

I want constant run time and no data-dependent memory access. – Conspicuous 25/4, 2015 at 14:3

A little golfing for the SIMD equality test: uminv.4s is a bit faster than 16b, and then we can simply mov %w0, v0.s[0] or fmov %w0, s0. Also, since the result at this point is guaranteed to be either 0 or 0xffffffff, we can replace the cmp/cset with simply and %w0, %w0, #1 if we still want to produce a boolean in a register. – Aposiopesis 28/3, 2022 at 3:20

Since the simple scalar ccmp implementation was the winner for the equality test, here's an equally simple scalar solution for less-than.

The approach for less-than above was based on lexicographic comparison, starting with the most significant limbs. I didn't see a good way to do that with ccmp. The problem is that in a branchless lexicographic compare, there are three possible states at each step: a previous limb compared less, a previous limb compared greater, all previous limbs compared equal. ccmp can't really keep track of three states. We could do it if ccmp's behavior when its condition is false were "do nothing", as with ARM32 conditionals, instead of "load flags with immediate".

So instead, here's an even more basic approach: do a multi-precision subtract, and check the carry flag at the end.

inline bool lt256(const uint64_t *a, const uint64_t *b) {
    const int limbs = 4;  // number of 64-bit chunks in a full number
    uint64_t a0,a1,b0,b1; // for scratch registers
    bool ret;
    asm(R"(
        ldp %[a0], %[a1], [%[a]]
        ldp %[b0], %[b1], [%[b]]
        subs xzr, %[a0], %[b0]
        sbcs xzr, %[a1], %[b1]
        ldp %[a0], %[a1], [%[a], #16]
        ldp %[b0], %[b1], [%[b], #16]
        sbcs xzr, %[a0], %[b0]
        sbcs xzr, %[a1], %[b1]
        )"
        : "=@cclo" (ret),
          [a0] "=&r" (a0), [a1] "=&r" (a1), [b0] "=&r" (b0), [b1] "=&r" (b1)
        : [a] "r" (a), [b] "r" (b),
          "m" (*(const uint64_t (*)[limbs])a),
          "m" (*(const uint64_t (*)[limbs])b)
        );
    return ret;
}

I chose to use a flag output operand for the result instead of explicitly writing a boolean to a register. (This feature didn't exist in a stable GCC release when the previous answer was written, and is still not supported by Clang for AArch64.) This could save an instruction, and a register, if the result will be branched on.

I also chose to do the loads within the asm. We could also use eight input operands and have the compiler do the loads, but then we would need eight registers instead of 4-6 as it stands. Worth trying if there is reason to think the limbs are already in general-purpose registers. Alternatively, you could reduce the register usage further by loading one pair of limbs at a time, instead of two, but at the cost of larger and probably slower code.

The zero register provides a convenient way to discard the numerical results of the subtractions, since we don't need them.

Performance should be pretty similar to the ccmp-based eq256, as both of them are essentially four subtracts in a dependency chain. Taking Cortex-A72 as an example, cmp/ccmp and subs/sbcs are all single-uop instructions that can execute on either of two integer pipelines. They don't say whether the flags are renamed, but if they are then you should be able to write two of these chains in series and have them execute in parallel.

Aposiopesis answered 28/3, 2022 at 3:8 Comment(3)

Yes, GCC6 was first released as a stable version in August 2016, so "=@ccxx" flag/condition-code output operands weren't available before then. And it took a while longer for Clang to support them, IIRC, but current Clang has _ExtInt(256) so inline asm isn't even needed if using that. BTW, the first subs could be a cmp, if that isn't already an alias for sub into the zero register. – Bevel 28/3, 2022 at 3:16

Apparently clang still doesn't support "=@cclo" for ARM64 in the latest trunk version. godbolt.org/z/1Yz1dMrbf . Clang 11 does support _ExtInt(256) for this, but latest trunk version has deprecated _ExtInt for _BitInt, and doesn't support sizes > 128. (Perhaps because of code-bloat with very large numbers, or because of unimplemented division?) – Bevel 28/3, 2022 at 3:26

@PeterCordes: Interesting notes. Too bad about _ExtInt(256) going away, though as you probably noticed, its < was worse than this naive approach. By the way, yes, cmp is an alias for subs xzr; I chose the latter here just to make it more symmetric with the sbcs xzr which doesn't have an alias. – Aposiopesis 28/3, 2022 at 3:48

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags