The simple test,
unsigned f(unsigned long long x) {
return __builtin_popcountll(x);
}
when compiled with clang --target=arm-none-linux-eabi -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a15 -Os
,⁎ results in the compiler emitting the numerous instructions required to implement the classic popcount for the low and high words in x
in parallel, then add the results.
It seems to me from skimming the architecture manuals that NEON code similar to that generated for
#include <arm_neon.h>
unsigned f(unsigned long long x) {
uint8x8_t v = vcnt_u8(vcreate_u8(x));
return vget_lane_u64(vpaddl_u32(vpaddl_u16(vpaddl_u8(v))), 0);
}
should have been beneficial in terms of size at least, even if not necessarily a performance improvement.
Why doesn’t Clang† do that? Am I just giving it the wrong options? Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on the A15, that it wouldn’t be worth it? (This is what a comment on a related question seems to suggest, but very briefly.) Is Clang codegen for AArch32 lacking for care and attention, seeing as almost every modern mobile device uses AArch64? (That seems farfetched, but GCC, for example, is known to occasionally have bad codegen on non-prominent architectures such as PowerPC or MIPS.)
⁎ Clang options could be wrong or redundant, adjust as necessary.† GCC doesn’t seem to do that in my experiments, either, just emitting a call to
__popcountdi2
, but that suggests I might simply be calling it wrong.
-O3
isn't a magic option. – Cash