I would like to compare two little-endian 256-bit values with A64 Neon instructions (asm) efficiently.
Equality (=)
For equality, I already got a solution:
bool eq256(const UInt256 *lhs, const UInt256 *rhs) {
bool result;
First, load the two values into SIMD registers.
__asm__("ld1.2d { v0, v1 }, %1 \n\t"
"ld1.2d { v2, v3 }, %2 \n\t"
Compare each 64-bit limb of the values with each other. This results in -1 (all bits set) for those limbs that are equal, and 0 (all bits clear) if a bit differs.
"cmeq.2d v0, v0, v2 \n\t"
"cmeq.2d v1, v1, v3 \n\t"
Reduce the result from 2 vectors to 1 vector, keeping only the one that contains "0 (all bits clear)" if there is any.
"uminp.16b v0, v0, v1 \n\t"
Reduce the result from 1 vector to 1 byte, keeping only a byte with zeroes if there is any.
"uminv.16b b0, v0 \n\t"
Move to ARM register, then compare with 0xFF. This is the result.
"umov %w0, v0.b[0] \n\t"
"cmp %w0, 0xFF \n\t"
"cset %w0, eq "
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "v0", "v1", "v2", "v3", "cc");
return result;
}
Questions
Is this more efficient than doing the 4 comparisons with plain old ARM registers?
- e.g. is there a source that quotes timings for the different operations? I'm doing this on iPhone 5s.
Is there a way to optimize this even further? I think that I waste many cycles just to reduce the whole vector to a single scalar boolean.
Less Than comparison (<)
Let's represent the two ints as tuples of 64-bit limbs (little-endian):
- lhs = (l0, l1, l2, l3)
- rhs = (r0, r1, r2, r3)
Then, lhs < rhs, if this evaluates to true:
(l3 < r3) & 1 & 1 & 1 |
(l3 = r3) & (l2 < r2) & 1 & 1 |
(l3 = r3) & (l2 = r2) & (l1 < r1) & 1 |
(l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)
SIMD instructions can now be used to evaluate multiple operands at a time. Assuming (l1, l2), (l3, l4), (r1, r2), (r3, r4) is the way that the two 256-bit numbers are stored, we can easily get all of the required values (useful values in bold):
- cmlo.2d => (l1 < r1), (l2 < r2)
- cmlo.2d => (l3 < r3), (l4 < r4)
- cmeq.2d => (l1 = r1), (l2 = r2)
- cmeq.2d => (l3 = r3), (l4 = r4)
Questions
- With these values in four SIMD registers, I now wonder what the best strategy is to apply the & and | operators, and then reducing it to a single boolean.
Update
I just punched together a working implementation for "less than".
Basically, I replaced the 1s above with a duplicate condition, because A & A == A & 1
.
Then, I lay out the three 2x2 squares in my matrix, and bitwise AND them. Now, I reduce with bitwise ORs - first from two vectors to one vector, then to one byte, then copy to ARM register, and test for 0xFF. Same pattern as for equality above.
Question above is still valid. I'm not sure if the code is optimal yet, and wonder if I missed some general SIMD pattern to do such stuff more efficiently. Also: Is NEON worth it for such cases, when the input operands come from memory?
bool lt256(const UInt256 *lhs, const UInt256 *rhs) {
bool result;
__asm__(// (l3 < r3) & (l3 < r3) |
// (l3 = r3) & (l2 < r2) |
// (l3 = r3) & (l2 = r2) & (l1 < r1) & (l1 < r1) |
// (l3 = r3) & (l2 = r2) & (l1 = r1) & (l0 < r0)
"ld1.2d { v0, v1 }, %1 \n\t"
"ld1.2d { v2, v3 }, %2 \n\t"
// v0: [ l3 = r3 ] [ l2 = r2 ]
// v1: [ l0 < r0 ] [ l1 < r1 ]
// v2: [ l0 = r0 ] [ l1 = r1 ]
// v3: [ l2 < r2 ] [ l3 < r3 ]
// v4: [ l2 = r2 ] [ l3 = r3 ]
"cmeq.2d v4, v1, v3 \n\t"
"cmlo.2d v3, v1, v3 \n\t"
"cmlo.2d v1, v0, v2 \n\t"
"cmeq.2d v2, v0, v2 \n\t"
"ext.16b v0, v4, v4, 8 \n\t"
// v2: [ l1 < r1 ] [ l1 = r1 ]
// v1: [ l1 < r1 ] [ l0 < r0 ]
"trn2.2d v2, v1, v2 \n\t"
"ext.16b v1, v1, v1, 8 \n\t"
// v1: [ l1 < r1 & l1 < r1 ] [ l1 = r1 & l0 < r0 ]
"and.16b v1, v2, v1 \n\t"
// v2: [ l3 < r3 ] [ l3 = r3 ]
// v3: [ l3 < r3 ] [ l2 < r2 ]
"ext.16b v2, v3, v0, 8 \n\t"
"ext.16b v3, v3, v3, 8 \n\t"
// v3: [ l3 < r3 & l3 < r3 ] [ l3 = r3 & l2 < r2 ]
"and.16b v3, v2, v3 \n\t"
// v2: [ l3 = r3 ] [ l3 = r3 ]
// v4: [ l2 = r2 ] [ l2 = r2 ]
"ext.16b v2, v4, v0, 8 \n\t"
"ext.16b v4, v0, v4, 8 \n\t"
// v2: [ l3 = r3 & l2 = r2 ] [ l3 = r3 & l2 = r2 ]
"and.16b v2, v2, v4 \n\t"
// v1: [ l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
// [ lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"and.16b v1, v2, v1 \n\t"
// v1: [ l3 < r3 & l3 < r3 |
// l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 ]
// [ l3 = r3 & l2 < r2 |
// lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"orr.16b v1, v3, v1 \n\t"
// b1: [ l3 < r3 & l3 < r3 |
// l3 = r3 & l2 = r2 & l1 < r1 & l1 < r1 |
// l3 = r3 & l2 < r2 |
// lr = r3 & l2 = r2 & l1 = r1 & l0 < r0 ]
"umaxv.16b b1, v1 \n\t"
"umov %w0, v1.b[0] \n\t"
"cmp %w0, 0xFF \n\t"
"cset %w0, eq"
: "=r" (result)
: "m" (*lhs->value), "m" (*rhs->value)
: "v0", "v1", "v2", "v3", "v4", "cc");
return result;
}
UInt256
used elsewhere, i.e. are the values more likely to be in SIMD registers, General-purpose registers, or memory beforehand? I'd imaginecmp
and 3ccmp
s might have less overhead than a bunch of SIMD register juggling, but having to spill a bunch of GP registers and load the values is liable to tip the balance the other way. I suspect the overall efficiency question is best answered by benchmarking, as it's the kind of thing that will also be impacted by the rest of your code (register pressure, cache usage, etc.) – Homey