Why doesn’t Clang use vcnt for __builtin_popcountll on AArch32?
Asked Answered
A

1

3

The simple test,

unsigned f(unsigned long long x) {
    return __builtin_popcountll(x);
}

when compiled with clang --target=arm-none-linux-eabi -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a15 -Os, results in the compiler emitting the numerous instructions required to implement the classic popcount for the low and high words in x in parallel, then add the results.

It seems to me from skimming the architecture manuals that NEON code similar to that generated for

#include <arm_neon.h>
unsigned f(unsigned long long x) {
    uint8x8_t v = vcnt_u8(vcreate_u8(x));
    return vget_lane_u64(vpaddl_u32(vpaddl_u16(vpaddl_u8(v))), 0);
}

should have been beneficial in terms of size at least, even if not necessarily a performance improvement.

Why doesn’t Clang do that? Am I just giving it the wrong options? Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on the A15, that it wouldn’t be worth it? (This is what a comment on a related question seems to suggest, but very briefly.) Is Clang codegen for AArch32 lacking for care and attention, seeing as almost every modern mobile device uses AArch64? (That seems farfetched, but GCC, for example, is known to occasionally have bad codegen on non-prominent architectures such as PowerPC or MIPS.)

Clang options could be wrong or redundant, adjust as necessary.
GCC doesn’t seem to do that in my experiments, either, just emitting a call to __popcountdi2, but that suggests I might simply be calling it wrong.
Audiovisual answered 17/11, 2021 at 16:57 Comment(7)
Your scope is way too small. Your function is like requesting a door-to-door ride for a single passenger, hence the compiler sends you a taxi. If you want to request a bus, you should tell them how many passengers first. Like the VFP, NEON is a separate unit that doesn't share the registers with the integer core. Many (or most) people overestimate compilers. -O3 isn't a magic option.Cash
I can't even imagine a compiler that could make proper use of NEON from your code. Cherrypicking some powerful NEON instructions here and there doesn't work. You should write the whole loop in intrinsics, from memory load to memory store.Cash
You may find it fun to benchmark both versions...Daimyo
* ARM (integer core) doesn't have any popcnt instruction while Intel does. Don't expect the target to have all the instructions equivalent to intrinsics of the toolchain.Cash
@NateEldredge Maybe, if I can find the relevant hardware in my “shoddy SBCs” bin and figure out how one measures cycles on ARM... But my original question wasn’t really “how do I make this particular function run as fast as possible?”, it was “how much suck am I in for on older archs if I bake the need for popcount into my data structures?”, and I was just surprised at the amount of suck I discovered by offhandedly calling the 32-bit ARM compiler. Actually optimizing for ARM will come much later, if ever.Audiovisual
@Jake'Alquimista'LEE The actual function I am looking at (so far) isn’t much larger, it’s just a fancy trie lookup (for Unicode properties) that doesn’t have much parallelism available: fetch-shift-popcount-add-fetch... While in reality it’s mostly going to work on strings with several lookups simultaneously, so I could perhaps do a bit of vectorization, I wouldn’t ordinarily consider it at this stage if NEON weren’t the only place I could find a popcount instruction. (And I’m not used to vector-to-integer latencies being quite so bad.)Audiovisual
@AlexShpilkin It's acutally not a latency, but a pipeline halt. Cortex A series has a 16 stage pipeline, and you know the rest of the story.Cash
S
3

Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on the A15, that it wouldn’t be worth it?

Well you asked very right question.
Shortly, yes, it's. It's slow and in most cases moving data between NEON and ARM CPU and vise-versa is a big performance penalty that over performance gain from using 'fast' NEON instructions.

In details, NEON is a optional co-processor in ARMv7 based chips. ARM CPU and NEON work in parallel and I might say 'independently' from each other.
Interaction between CPU and NEON co-processor is organised via FIFO. CPU places neon instruction in FIFO and NEON co-processor fetch and execute it. Delay comes at the point when CPU and NEON needs sync between each other. Sync is accessing same memory region or transfering data between registers.

So whole process of using vcnt would be something like:

  • ARM CPU placing vcnt into NEON FIFO
  • Moving data from CPU register into NEON register
  • NEON fetching vcnt from FIFO
  • NEON executing vcnt
  • Moving data from NEON register to CPU register

And all that time CPU is simply waiting while NEON is doing it's work.
Due to NEON pipelining, delay might be up to 20 cycles (if I remember this number correctly).

Note: "up to 20 cycles" is arbitrary, since if ARM CPU has other instructions that does not depend on result of NEON computations, CPU could execute them.

Conclusion: as a rule of thumb that's not worthy, unless you are manually optimise code to reduce/eliminate that sync delays.

PS: That's true for ARMv7. ARMv8 has NEON extension as part of a core, so it's not relevant.

Smock answered 17/11, 2021 at 23:42 Comment(5)
Unfortunately, gcc or clang -O3 -mcpu=cortex-a53 doesn't use NEON for this when making 32-bit code either, though. godbolt.org/z/fMGed8x6E So while this might be the right decision for -mcpu=cortex-a15, it's probably not for A53. Especially not for 64-bit integers which have to repeat a 32-bit count sequence in 32-bit mode.Ergotism
ARM 2 NEON is fast, it actually takes one single cycle.Cash
NEON -> ARM though is not fast at all... Still for ARM -> NEON that's an additional cycle just 'for preparation', and you need to wait a whole bunch any waySmock
There's also the sequence of three vpaddl to sum up the per-byte popcounts, which are a dependency chain, so that's quite a few more cycles of latency. AArch64 can do a little better with addv and that may be another part of why the compiler uses SIMD for popcount on AArch64 but not AArch32.Daimyo
@Smock The first NEON->ARM transfer takes 15 cycles on Cortex-A8, but the following ones take 1 cycle each. You should avoid this at all costs, but sometimes you don't have any other option, then you do it clustered. On the other hand, ARM->Neon is a very common way loading 16/32bit literals to vectors instead of memory load. You just don't see it unless you read the disassembly.Cash

© 2022 - 2024 — McMap. All rights reserved.