This turns out to be not at all simple.
The naïve efficient approach starts with trivially getting the indices (just load a static vector of 0 1 2 3 4 5 6 7
and vand
it with the bitmask). However, in order to then collect them at one end of the output vector - in different lanes to the input lanes they represent - you then need some arbitrary permutation operation. There's only one instruction capable of arbitrarily permuting vectors, vtbl
(or vtbx
, which is essentially the same thing). However, vtbl
takes a vector of source indices in destination order, which turns out to be the exact same thing you're trying to produce. Thus in order to produce your final result, you need to use your final result, therefore the naïve efficient solution is impossible; QED.
The fundamental problem is that what you're effectively doing is sorting a vector, which is inherently not a parallel SIMD operation. NEON is a parallel SIMD instruction set designed for media processing, and is really not cut out for any of the data-dependent/horizontal/scatter-gather operations of more general vector processing.
To prove the point, I did manage to do this in pure NEON, without any scalar code at all, and it's horrible; the best "one or two bit-shift NEON instructions" I could come up with is some conditional-select-based rotating bitmask accumulation trickery. If it's not clear, I'd suggest stepping through in a debugger or simulator to follow what it does (example):
// d0 contains input vector
vmov.u8 d1, #0
vmov.u8 d2, #0
vmvn.u8 d3, #0
vdup.u8 d4, d0[0]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[1]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[2]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[3]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[4]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[5]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[6]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vsub.u8 d1, d1, d3
vdup.u8 d4, d0[7]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
vbic.u8 d1, d1, d3
// d1 contains output vector
Cheating and using a loop (which necessitates rotating d0
in the opposite direction such that we can access each original lane through d0[0]
) makes it smaller, but not really any less awful:
vmov.u8 d1, #0
vmov.u8 d2, #0
vmvn.u8 d3, #0
mov r0, #8
1:
vdup.u8 d4, d0[0]
vext.u8 d5, d2, d3, #7
vbit.u8 d3, d5, d4
subs r0, r0, #1
vext.u8 d0, d0, d0, #1
vsub.u8 d1, d1, d3
bne 1b
vbic.u8 d1, d1, d3
Ideally, if it's at all possible to rework other parts of the algorithm to avoid needing non-constant permutations of vectors, do that instead.