ARM Neon: Store n-th position(s) of non-zero byte(s) in a 8-byte vector lane

Asked 15/9, 2016 at 8:34 Answered 10/8, 2020 at 5:46

I want to convert a Neon 64-bit vector lane to get the n-th position(s) of non-zero (aka. 0xFF) 8-bit value(s), and then fill the rest of the vector with zeros. Here are some examples:

    0  1  2  3  4  5  6  7

d0: 00 FF 00 FF 00 00 00 FF
d1: 1  3  7  0  0  0  0  0

d0: 00 FF FF FF 00 00 FF 00
d1: 1  2  3  6  0  0  0  0

d0: FF FF FF FF FF FF FF FF
d1: 0  1  2  3  4  5  6  7

d0: FF 00 00 00 00 00 00 00
d1: 0  0  0  0  0  0  0  0

d0: 00 00 00 00 00 00 00 00
d1: 0  0  0  0  0  0  0  0

I have the feeling that it's probably one or two bit-shift Neon instructions with another "good" vector. How can I do that?

Dispersant answered 15/9, 2016 at 8:34 Comment(1)

So you want to turn a boolean 0/-1 vector into a left-packed vector of indices of the non-0 elements, it looks like. – Socialite 10/8, 2020 at 17:13

This turns out to be not at all simple.

The naïve efficient approach starts with trivially getting the indices (just load a static vector of 0 1 2 3 4 5 6 7 and vand it with the bitmask). However, in order to then collect them at one end of the output vector - in different lanes to the input lanes they represent - you then need some arbitrary permutation operation. There's only one instruction capable of arbitrarily permuting vectors, vtbl (or vtbx, which is essentially the same thing). However, vtbl takes a vector of source indices in destination order, which turns out to be the exact same thing you're trying to produce. Thus in order to produce your final result, you need to use your final result, therefore the naïve efficient solution is impossible; QED.

The fundamental problem is that what you're effectively doing is sorting a vector, which is inherently not a parallel SIMD operation. NEON is a parallel SIMD instruction set designed for media processing, and is really not cut out for any of the data-dependent/horizontal/scatter-gather operations of more general vector processing.

To prove the point, I did manage to do this in pure NEON, without any scalar code at all, and it's horrible; the best "one or two bit-shift NEON instructions" I could come up with is some conditional-select-based rotating bitmask accumulation trickery. If it's not clear, I'd suggest stepping through in a debugger or simulator to follow what it does (example):

// d0 contains input vector
vmov.u8 d1, #0
vmov.u8 d2, #0
vmvn.u8 d3, #0
vdup.u8 d4, d0[0]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[1]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[2]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[3]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[4]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[5]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[6]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[7]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vbic.u8 d1, d1, d3
// d1 contains output vector

Cheating and using a loop (which necessitates rotating d0 in the opposite direction such that we can access each original lane through d0[0]) makes it smaller, but not really any less awful:

vmov.u8 d1, #0
vmov.u8 d2, #0
vmvn.u8 d3, #0
mov r0, #8
1:
vdup.u8 d4, d0[0]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
subs r0, r0, #1
vext.u8 d0, d0, d0, #1
vsub.u8 d1, d1, d3
bne 1b
vbic.u8 d1, d1, d3

Ideally, if it's at all possible to rework other parts of the algorithm to avoid needing non-constant permutations of vectors, do that instead.

Hegumen answered 16/9, 2016 at 1:21 Comment(7)

This problem reminds me of left-packing based on a compare result. If there are any existing techniques to generate a left-packing shuffle mask based on a compare result, you're all set. My AVX2 + BMI2 answer on that Q uses x86's pext bitfield extract instruction with an integer bitmask (from MOVMSKPS: one bit from each vector element, typically useful on compare results). – Socialite 16/9, 2016 at 1:44

Without PEXT (which I assume ARM has no equivalent for), there are other SIMD left-packing methods that might be usable. Oh, but reading more carefully, I think those slides just load shuffle masks from a LUT. With 8 bytes per 64-bit lane, that's a 256 entry LUT lookup for each 64 bits of input (assuming ARM can efficiently do a vector compare / test and turn the compare result into a bitmask for use as an integer array index) – Socialite 16/9, 2016 at 1:47

@PeterCordes - "If there are any existing techniques to generate a left-packing shuffle mask based on a compare result, you're all set" - er, that's precisely what we're trying to achieve here ;) That's why the sole available arbitrary-shuffling instruction can't help, because it simply makes the problem recursive. – Hegumen 16/9, 2016 at 8:59

The LUT approach would actually be even more painful, as the "turn the compare result into a bitmask" step would pretty much have to be a similar-looking chain of ANDs (or shifts), ORs and vector rotations which wouldn't be much smaller than what I have here, and you've still then got to bounce data to a core register; at which point you've still got a ton of clunky SIMD instructions pretending to be a horizontal operation, but now you've also got a load of memory spent on a table and a big ol' pipeline bubble. – Hegumen 16/9, 2016 at 9:3

I should have just asked if ARM has an instruction like PMOVMSKB, and then the answer would just be a simple "no", apparently. xD. I'm surprised, it's very useful and doesn't seem like it should be expensive in silicon. – Socialite 16/9, 2016 at 14:51

Right now, my code dumps every d0[i] and compare. So unfortunately, this answer makes clear that it's not the right direction. I will re-arrange the algorithm very differently. – Dispersant 16/9, 2016 at 15:3

For future reference: SSE _mm_movemask_epi8 equivalent method for ARM NEON: mask to select the bits, and horizontal vpadd_u8 / u16 / u32 to get halves of the bitmap. Not great, and the trip to an integer register is a problem. – Socialite 26/3, 2018 at 19:31

I've planned to implement this with a divide and conquer technique using variable shifts. In each step the input is treated as 'high' and 'low' parts, where the 'high' part needs to be shifted right (towards the least significant bits) first by 0 or 1 bytes, then by 0-2 bytes, then by 0-4 bytes.

The solution allows to use the 'q' variant in all instructions allowing two independent compactions to be carried in parallel.

///  F F 0 F F 0 0 F   mask
///  7 6 - 4 3 - - 0   mask & idx
///  7 6 - 4 - 3 - 0   pairwise shift.16 right by 1-count_left bytes
///    2   1   1   1   count_left + count_right = vpaddl[q]_s8 -> s16
///  - 7 6 4 - - 3 0   pairwise shift.32 right by 2-count_left bytes
///        3       2   count_left + count_right = vpaddl[q]_s16 -> s32
///  - - - 7 6 4 3 0   shift.64 right by 4 - count_left bytes
///                5   number of elements to write = vpaddl[q]_s32 -> s64

The first step can be done without actual shifts by

int8x8_t step1(int8x8_t mask) {
    auto data = mask & vcreate_u8(0x0706050403020100ull);
    auto shifted = vrev16_u8(data);
    return vbsl_u8(vtst_s16(mask, vdup_n_s16(1)), data, shifted);
}

The next step needs to isolate the top 16 and bottom 16 bits of each uint32_t lane, shift the top part by -16,-8 or 0 bits, then combine with isolated bottom bits.

int8x8_t step2(int8x8_t mask, int16x4_t counts) {
    auto top = vshr_n_u32(top, 16);
    auto cnt = vbic_s32(counts, vcreate_u32(0xffff0000ffff0000ull));
    auto bot = vbic_u32(mask, vcreate_u32(0xffff0000ffff0000ull));
    top = vshl_u32(top, cnt);
    return vorr_u8(top, bot); 
}

Third step needs to shift 64 bit elements.

int8x8_t step3(int8x8_t mask, int32x4_t counts) {
    auto top = vshr_n_u64(top, 32);
    auto cnt = vbic_s64(counts, vcreate_s32(0xffffffff00000000ull));
    auto bot = vbic_u64(mask, vcreate_u32(0xffffffff00000000ull));
    top = vshl_u64(top, cnt);
    return vorr_u8(top, bot); 
}

Full solution:

auto cnt8 = vcnt_s8(mask);
mask = step1(mask);
auto counts16 = vpaddl_s8(cnt8, cnt8);
mask = step2(mask, counts16);
auto counts32 = vpaddl_s16(counts16, counts16);
mask = step3(mask, counts32);
auto counts64 = vpaddl_s32(counts32, counts32);

The final 'counts64' should be actually calculated in advance, as the count needs to be transferred to a general purpose register as it's used to increment a streaming write pointer in bytes:

vst1_u8(ptr, mask); ptr += count64 >> 3;

An asymptotically better version would actually try to obtain 64 (+64= bytes worth of masks, compact those bytes into bit patterns (as in Intel's movmaskb), then use 8 iterations to find leading zeros + clear leading one or find + toggle least significant set bit.

This can be achieved with 5 instructions per iteration:

// finds + toggles least significant bit
auto prev = mask;
mask &= mask - 1u;
auto index0 = vclz[q]_u8(prev ^ mask);
// assuming reversed bit ordering

These 8 indices[0..7] would need to be transposed as 8x8 matrix, then written sequentially; the total amount of instructions per 64 + 64 bytes would be something near 64 instructions total, or 0.5 instructions per output byte.

Eloiseelon answered 10/8, 2020 at 5:46 Comment(0)

You can do this by sorting the vector. That's a more complex operation than you'd expect for an operation like this, but I haven't come up with anything better.

Given a list of 00/ff bytes in d0, and the constants 0, 1, 2, ..., 7 in d1, you can make a sortable list of active columns using vorn.

vorn.u8 d0, d1, d0

Now all the unwanted lanes of d0 have been replaced with 0xff and rest have been replaced with their lane index. From there you can sort that list to cluster all the unwanted lanes at the end.

To do that, you have to extend the list to 16 bytes:

vmov.u8 d1, #255

And then split them into odd/even vectors:

vuzp.u8 d0, d1

The sort operation consists of vmin/vmax between those vectors, then a stagger operation, then another vmin/vmax pair to sort between different pairs, so that values can bubble towards their proper place. Like so:

vmin.u8 d2, d0, d1
vmax.u8 d3, d0, d1
vsri.u64 d2, d2, #8   ; stagger the even lanes (discards d0[0])
vmax.u8 d4, d2, d3    ; dst reg would be d0, but we want d0[0] back...
vmin.u8 d1, d2, d3
vsli.u64 d0, d4, #8   ; revert the stagger, and restore d0[0]

This implements two stages of the full network, and the whole block has to be repeated four times (eight stages) to make it possible for something in d0[7] to bubble all the way down to d0[0] in the extreme case of the last byte being the only nonzero input, or for d0[0] to get to d0[7] if the first byte was the only zero input.

Once you're done sorting them, stitch the result back together:

vzip.u8 d0, d1

And because you wanted zeroes in the leftover lanes:

vmov.u8 d1, #0
vmax.s8 d0, d1

And now d0 should contain the result.

If you check out Wikipedia's sorting network page you'll see that the theoretical minimum depth for eight lanes is only six stages (six pairs of vmin/vmax), so it may be possible to find a set of permutations (replacing my vsli and vsri operations) which implement a six-stage sorting network rather than the eight-stage insertion/selection/vbubble sort that I've implemented. If such does exist, and is compatible with NEON's permute operations, then it'd certainly be worth finding, but I don't have time to look.

Also note that the sort is working on a total of 16 bytes, which is more than you need, and if you used q registers you could have it work on 32 bytes... so this is a long way from maximum throughput.

Oh, and even in this configuration I think you don't need the last stage of the sort. An exercise left for the reader.

Spermatozoon answered 19/9, 2016 at 0:24 Comment(4)

The Wiki page you linked isn't considering SIMD; more like 8 scalars in separate registers. (Or 8 vertical sorts in parallel, instead of one horizontal sort of an 8-element vector). Since shuffling is not free compared to min/max comparators, the minimum depth network might be more expensive. – Socialite 19/9, 2016 at 1:19

Yes, a lot of the operations are wasted in SIMD, but the depth metric for networks still represents the shortest sequence possible with infinite-width SIMD. My code is already paying the permutation tax (vsli/vsri), so the hope is that there could be alternative operations that need fewer rounds. – Spermatozoon 19/9, 2016 at 3:55

Yeah fair point, you're sorting one register horizontally, so that's always the worst case for SIMD. I was thinking of the case where you're sorting something like 16 floats in four 128b registers, so some of the comparators are vertical without shuffling, and moving data between vectors is an issue (limited by x86 SSE2 shufps semantics). ARM has arbitrary byte permutes (with a shuffle-mask in a register), right? (like SSSE3 pshufb). In that case you could implement any network you want (but it might take some extra vector constants). – Socialite 19/9, 2016 at 4:8

vext with a constant index table is a bit last-resort because it can take a couple of cycles to complete. There's a healthy collection of static permutes available, but sorting networks are already hard to optimise even without those constraints. It's easier to just shrug and throw it all into a bitonic sort network beyond a certain point, or a dynamic sort somewhere beyond that. In fact, Wikipedia shows a six-stage eight-lane bitonic sorter which warrants investigation here... – Spermatozoon 19/9, 2016 at 4:53

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags