Fastest way to count number of 1s in a register, ARM assembly
Asked Answered
M

8

19

So I had an interview question before regarding bit manipulation. The company is a well known GPU company. I had very little background in assembly language (weird despite being a phd student in computer architecture) and as this narrative would indicate, I botched it. The question was a simple:

"Write a fast code that will count the number of 1's in a 32-bit register."

Now I am in the process of studying arm assembly. So naturally I revisited this problem again and have come up with this code just by studying the ISA.

For you arm experts out there, is this correct? Is there a faster way to do this? Being a beginner, I naturally think this is incomplete. The AND instruction in "xx" feels redundant but there is no other way to shift a register in ARM isa...

R1 will contain the number of bits at the end while R2 is the register with bits we want to count. r6 is just a dummy register. Comments are enclosed in ()

    MOV   R1, #0                (initialize R1 and R6 to zero)
    MOV   R6, #0        
xx: AND   R6, R6, R2, LSR #1    (Right shift by 1, right most bit is in carry flag)
    ADDCS R1, #1                (Add #1 to R1 if carry  flag is set)
    CMP R2, #0                  (update the status flags if R2 == 0 or not)
    BEQ xx                      (branch back to xx until R2==0)
Mystery answered 1/4, 2013 at 1:49 Comment(2)
What is fastest also depends on the number of bits you expect to be set. With very few 1's, Kernighan's bit counter will win as it runs one round of the loop for each bit set.Aeroplane
Your algorithm takes 4*32+2 instructions for all 1. Dave Seal's algorithm takes six instructions plus a memory access; it will probably be as fast for all types of input unless memory is extremely slow. I doubt few people would get Dave Seal's solution. The reverse subtract is just weird to me.Enforce
T
5

You could use a precomputed look up table and reduce the number of iterations to 2 or 4.

You could also use the logarithmic approach.

For more info see this Wikipedia article.

Trinitarianism answered 1/4, 2013 at 2:1 Comment(2)
Looks like Hamming-weight is a pretty well-known problem. Thanks!Mystery
+1 for the hamming weight article. Note that, specifically for ARM, it mentions that Neon SIMD instructions provide vcnt, infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/… to perform exactly this operation. There are also, for SIMD-type instruction sets. methods to speed it up using shuffling instructions as table lookups (every SSE2 register can be used as a 16-byte lookup table via pshufb, or in case of ARM Neon, vtbl). See stackoverflow.com/questions/3693981/…Dreadnought
E
12

If this code is fast or not depends on the processor. For sure it will be not very fast on Cortex-A8 but may run very fast on Cortex-A9 and newer CPU.

It is however a very short solution.

Expects input in r0, and returns output in r0

  vmov.32 d0[0], r0
  vcnt.8  d0, d0
  vmov.32 r0, d0[0]

  add r0, r0, r0, lsr #16
  add r0, r0, r0, lsr #8
  and r0, r0, #31

The main work is done in the vcnt.8 instruction which counts the bits of each byte in a NEON register and stores the bitcount back into the bytes of D0.

There is no vcnt.32 form, only .8, so you need to horizontally add the 4 bytes together, which is what the rest of the code is doing.

Enclave answered 1/4, 2013 at 9:52 Comment(2)
vmov.32 r0, d0[0] causes a VERY big sync delay between NEON and CoreConstitutionality
@user3124812: My understanding is that depends on the CPU. I think many ARMv8 CPUs have much less bad stalls, or if they have out-of-order exec, can do it around the data dependency (so there's some latency but not a stall). This applies even when running in 32-bit mode, AFAIK. But yes I think there are some CPUs where this is slow. It's probably less slow than a loop on most, unless only a couple bits are set and you're looping over set bits. I don't know how it compares to the branchless SWAR bithack from Hacker's Delight shown in auslen's answer; might well vary by CPU model.Obregon
M
7

Best references for bit hacks are

Bit Twiddling Hacks page says

The best method for counting bits in a 32-bit
integer v is the following:

v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

Then I would suggest you to use gcc and objdump (or this great online gcc tool) to see how this high level snippet would look like as arm instructions.

00000000 <popcount>:
 0: 1043        asrs    r3, r0, #1
 2: f003 3355   and.w   r3, r3, #1431655765 ; 0x55555555
 6: 1ac0        subs    r0, r0, r3
 8: 1083        asrs    r3, r0, #2
 a: f000 3033   and.w   r0, r0, #858993459  ; 0x33333333
 e: f003 3333   and.w   r3, r3, #858993459  ; 0x33333333
12: 18c0        adds    r0, r0, r3
14: eb00 1010   add.w   r0, r0, r0, lsr #4
18: f000 300f   and.w   r0, r0, #252645135  ; 0xf0f0f0f
1c: eb00 2000   add.w   r0, r0, r0, lsl #8
20: eb00 4000   add.w   r0, r0, r0, lsl #16
24: 1600        asrs    r0, r0, #24
26: 4770        bx  lr

So it looks like this gives you result in 12 instructions, which roughly can translate to same amount of cycles.

Comparing integer twiddling above to look up table approach used by libgcc, look up table should be even slower considering extra memory accesses.

00000028 <__popcountSI2>:
28: b410        push    {r4}
2a: 2200        movs    r2, #0
2c: 4c06        ldr r4, [pc, #24]   ; (48 <__popcountSI2+0x20>)
2e: 4613        mov r3, r2
30: fa40 f103   asr.w   r1, r0, r3
34: 3308        adds    r3, #8
36: 2b20        cmp r3, #32
38: b2c9        uxtb    r1, r1
3a: 5c61        ldrb    r1, [r4, r1]
3c: 440a        add r2, r1
3e: d1f7        bne.n   30 <__popcountSI2+0x8>
40: 4610        mov r0, r2
42: bc10        pop {r4}
44: 4770        bx  lr
46: bf00        nop
48: 00000000    andeq   r0, r0, r0
<.. snipped ..>
Marga answered 1/4, 2013 at 8:43 Comment(6)
How can you place a 32-bit constant into a 32-bit instruction?Histoid
@LưuVĩnhPhúc say what?Marga
like this and.w r0, r0, #252645135 ; 0xf0f0f0f the constant's size is 32 bitsHistoid
@LưuVĩnhPhúc well, that's compilers output so I assume it is very possible. Idea is if constant has low entropy you can stuff that inside an instruction. There should be an efficient way to load say 0x80000000 into a register, otherwise it would be a poor ISA.Marga
The and.w instruction isn't a vanilla AND, you can tell from the f prefix in the conditional field of the opcode. In this case, it is given an immediate of a single byte, 0x0f, which it applies to all 4 bytes of the register input, effectively 0x0f0f0f0fDanidania
@Histoid see Operand2 in ARM. Flexible second operand supports complex packing of form 0xXYXYXYXY, for instance, where X and Y are 4-bit nibbles encoded in the instruction. Also supports variable shift on the constant value.Craiova
E
7

Since this is tagged ARM, the clz instruction is most helpful. The problem is also described as a population count. gcc has __builtin_popcount() for this. As does the ARM tools. There is this link (don't feel bad about your solution, some one made a web page with nearly the same) and also there is Dave Seal's version with six instruction for non-clz ARMs. The clz is advantageous and can be used to produce a faster algorithm, depending on the input.

As well as auselen's good reading suggestion,Hacker's Delight this bit twiddling blog maybe helpful which is talking about such things in a graphic context. At least I found it useful to understand some of Qt's blitting code. However, it has some usefulness in coding a population count routine.

The carry add unit is useful in a divide and conquer sense, making the problem O(ln n). clz is more useful if the data has runs of ones or zeros.

The Hacker's Delight entry has more background on Dave Seal's ARM code.

Enforce answered 1/4, 2013 at 13:17 Comment(7)
I think __builtin_popcount is compiled from look up table implementation, it should be good enough but nothing specially crafted for arm.Marga
@auselen: check what ARM architecture level you're targeting. clz is available only since ARMv5.Wagner
@IgorSkochinsky I don't think there is a clz implementation provided by gcc or llvm. __builtin_popcount() links to __popcountsi2.Marga
Well, popcount does more than clz so it needs a helper routine. __builtin_clz or __builtin_ctz emit clz directly for supported architectures.Wagner
@IgorSkochinsky may be I should clarify by saying I don't think there is any popcount implementation using clz - at least I couldn't see any.Marga
@Marga You are correct. I looked at libgcc too, only PowerPC is currently better, but at least you only pay for the table once. I already have 7 edits. I think actually the mul will give a better speed increase. clz is linear. If you can preserve bit count with a hash, you can use a table to get the mapping 2^32 maps to 2^5; or maybe Dave Seal's is already the best pure ARM , which wouldn't be a surprise.Enforce
For instance, a population count on binary image/picture data may benefit from clz, especially if it has a lot of low frequency content. I think the question was asked in the context of graphics.Enforce
T
5

You could use a precomputed look up table and reduce the number of iterations to 2 or 4.

You could also use the logarithmic approach.

For more info see this Wikipedia article.

Trinitarianism answered 1/4, 2013 at 2:1 Comment(2)
Looks like Hamming-weight is a pretty well-known problem. Thanks!Mystery
+1 for the hamming weight article. Note that, specifically for ARM, it mentions that Neon SIMD instructions provide vcnt, infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/… to perform exactly this operation. There are also, for SIMD-type instruction sets. methods to speed it up using shuffling instructions as table lookups (every SSE2 register can be used as a 16-byte lookup table via pshufb, or in case of ARM Neon, vtbl). See stackoverflow.com/questions/3693981/…Dreadnought
D
2
    LDR r0, = 0x000000FF;
    MOV r1, #0;
    MOV r3, #0; this will always be zero
    MOV r2,r0;
rep MOVS r2, r2, LSR #1;
    ADC r1,r1, r3;  this adds r1 with zero plus the carry bit
    CMP r2, #0;
    BNE rep

This will do it, r3 is just a dummy register with 0 to make ADC work properly.

Dipsomania answered 26/4, 2016 at 21:4 Comment(2)
Shifting 1 bit at a time is not fast.Obregon
@Avsharian: "CMP" is redundant here (do you see why?). The same with r3: #0 would be enough "to make ADC work properly" (as to the initial "LDR", I'm just wondering...). Keep an eye if you're writing in assembly, at least on StackOverflow :)Lazar
T
1

long count_bits_long (long);

    vmov.32 d0[0], r0       // R0 --> SIMD

    vcnt.8  d0, d0          // count bits in bytes
    vpaddl.u8 d0, d0        // add adjacent pairs of bytes and put into 16b words
    vpaddl.u16 d0, d0       // add adjacent pairs of 16b words and put into 32b word

    vmov.32 r0, d0[0]       // SIMD --> R0

    mov pc, lr              // return
Transmissible answered 8/10, 2017 at 17:50 Comment(1)
This is @Nils's answer, but NEON horizontal sum instead of scalar. Which CPUs is this faster on?Obregon
O
0

In AArch64, the CSSC extension introduces a scalar form of the popcount instruction:

 cnt  w0, w0

I don't think this is available in 32-bit mode so this answer is slightly off-topic, only vcnt which requires copying to a NEON register and back. Which may stall the pipeline on CPUs that don't couple them tightly, so it might perhaps be faster to use a scalar bithack to popcount in some cases, or even a loop if you expect usually only a couple bits to be set. (I think it's more common for AArch64 CPUs to not stall, or not too badly, when moving data between integer and vector regs, but they're in the same boat without CSSC; see compiler output.)

GCC13 added support for CSSC, which you have to enable manually with -march=armv8-a+cssc. Even -march=armv9.3-a doesn't get GCC or clang to use it (Godbolt) for C++20 std::popcount (i.e. for __builtin_popcount()) without the +cssc part. -mcpu=cortex-x3 and -mcpu=cortex-a710 don't enable it either, so I assume they don't have it.

Obregon answered 24/8, 2023 at 3:8 Comment(0)
E
0

If you don’t have a popcnt instruction, typically this is done by using the vperm/pshufb/vtbl instruction to lookup the number of bits in a table using the shuffle instruction. Roughly in pseudocode :

ucharN someVector = …;
ucharN lowNibbles = someVector & 0xf;
ucharN highNibbles = someVector >> 4;

static const ucharN popcntTable = { 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4…}; // vector[i] = popcnt(i)
return vtblq_u8(lowNibbles, popcntTable) + 
       vtblq_u8(highNibbles, popcntTable);

Depending on vector arch you’ll need to adjust the table size and table values to suit the quirks of the shuffle instruction, such as AVX2’s barrier in middle of the register, or arm32s 8-byte shuffle unit. Possibly I also have the vtbl arguments swapped. Hopefully you get the idea.

Elytron answered 31/8, 2023 at 5:20 Comment(11)
Are there any ARM CPUs with NEON vtbl but not vcnt? (developer.arm.com/documentation/dui0489/i/…). (From there you need a horizontal sum of the per-byte results, whether you get them from vcnt or vtbl.)Obregon
This technique (of splitting into nibbles for a SIMD byte LUT) is useful on x86 as you mention, but AFAIK not on ARM unless there are such CPUs. If you want to post it somewhere, look for an x86 AVX2 question, or PowerPC Altivec. And make sure you link github.com/WojciechMula/sse-popcount for state-of-the-art popcount over whole arrays with SSE2, AVX2, or various AVX-512. And some NEON implementations. ARMv7 results: github.com/WojciechMula/sse-popcount/blob/master/results/arm/… from 2017 shows VCNT being so fast that Harley Seal to feed it was slowerObregon
You’d have to check in which arm arch each instruction was introduced. It certainly was a problem for SSE and AltiVec. Typically pairwise adds would be used to do the horizontal sum for vectors of >8 bit elements. As an interview question, I’d wager they are not looking for “just use vcnt” as the answer, but no harm in mentioning it.Elytron
NASM: Count how many bits in a 32 Bit number are set to 1 is an x86 asm version of this ARM question. My answer there shows multiple scalar ways, but only mentions pshufb without showing code.Obregon
Also, the vcnt instruction on a vector arch commonly will not include the full suite of 8, 16, and 32 bit types, so even if present you may find yourself still weighing whether to use the table lookup vs. expansion all the way out to 32-bits and back for a 8 bit type .Elytron
Yeah, different architects included different features in the first versions of their SIMD ISAs. x86 didn't get a popcnt until SSE4.2 in Nehalem (after pshufb in SSSE3 in Core 2), and that only for scalar. SIMD popcnt wasn't until Ice Lake over a decade later, with AVX-512VPOPCNT and AVX-512BITALG. But AFAICT (e.g. from an ARMv7 ISA manual from 2008) ARM didn't have NEON at all until ARMv7, which introduced all the NEON aka Advanced SIMD instructions at once, with NEON as an all or nothing optional extension to ARMv7.Obregon
There was some phasing in for fp16 in cortex-a9. It wasn’t entirely born in one fell swoop, for example. Certainly arm64 has had its share of evolution.Elytron
Yes, as I mentioned in my first comment, you do still need a horizontal sum of the 4 bytes after vcnt.8, because it's only ever available with 8-bit element size. Your answer doesn't mention this, returning a SIMD vector with some per-nibble sums. It doesn't even use separate d halves of a q register for the shift vs. mask to set up for one vtbl instead of two. (q-sized vtbl may be slower on some CPUs, or maybe the same speed as two d-sized vtbl, but you need the 4-bit indices.) Probably hard to express that with intrinsics; at least in a way that compiles to non-garbage.Obregon
Ok, so there was some evolution after the initial ARMv7 NEON. But ARMv7 NEON included vcnt, and no earlier ARM SIMD extension included vtbl or an equivalent byte-shuffle, AFAIK. And AFAIK, there's no way for a later CPU to omit vcnt but not vtbl. (at least not to advertize via the CPUID mechanism introduced in ARMv7, AFAIK. Of course a custom core could leave out whatever you want and code for it would just have to avoid certain instructions.) Also, vtbl is slow on some low-end ARM cores, and NEON->integer transfers are also slow on some cores, stalling the pipelines to sync.Obregon
Since you mention this being an interview question, yeah the original interview question might have been looking for general-purpose-integer bithacks or efficient loops using 1-bit shifts or Kernighan's clear-lowest-set-bit. IDK if they were hoping for a NEON answer, but if I was looking for ARM knowledge specifically and someone proposed a nibble-LUT instead of using vcnt, I wouldn't be very impressed. Especially if they forgot to horizontal-sum the result. The nibble-LUT is a useful technique in general, just not for this on ARM, only other ISAs.Obregon
I meant to link Why doesn’t Clang use vcnt for __builtin_popcountll on AArch32? earlier. Nate's comment on the accepted answer mentions using a sequence of vpaddl to hsum the per-byte counts, since unlike in AArch64, we don't have a single-instruction hsum over more than pairs. But anyway, it's interesting that a generally-good compiler would choose not to use an integer -> NEON -> integer round trip for a single scalar popcount. That's probably a reasonable tuning decision for some CPUs, not a missed-optimization. (But maybe missed-opt for others)Obregon

© 2022 - 2024 — McMap. All rights reserved.