NASM: Count how many bits in a 32 Bit number are set to 1
Asked Answered
M

9

5

I have a 32 Bit number and want to count know how many bits are 1.

I'm thinking of this pseudocode:

mov eax, [number]
while(eax != 0)
{
  div eax, 2
  if(edx == 1)
  {
   ecx++;
  } 
  shr eax, 1
}

Is there a more efficient way?

I'm using NASM on a x86 processor.

(I'm just beginning with assembler, so please do not tell me to use code from extern libraries, because I do not even know how to include them ;) )

(I just found How to count the number of set bits in a 32-bit integer? which also contains my solution. There are other solutions posted, but unfortunatly I can't seem to figure out, how I would write them in assembler)

Manhood answered 28/5, 2010 at 18:19 Comment(2)
Obviously you shouldn't actually use div, that's one of the slowest integer instructions. Just check the low bit of EAX with test al, 1. Or shr eax,1 / adc ecx, 0 would be an efficient way to implement that pseudo-code.Broadminded
Hamming weight ( number of 1 in a number) mixing C with assembly shows an example of efficiently looping, using both the ZF and CF results from one SHR. If you're going to use a simplistic bit-at-a-time loop, that's how to do it efficiently.Broadminded
S
8

The most efficient way (in terms of execution time, anyway) is to have a lookup table. Obviously you're not going to have a 4-billion entry table, but you could break the 32 bits down into 8-bit chunks and only need a 256-entry table, or further down into 4-bit chunks and only need 16 entries. Good luck!

Scatter answered 28/5, 2010 at 18:22 Comment(5)
If upfront cost is a problem, you could build the lookup table as you go. You know only ONE entry will have a value of 0 1's, and that's 0x00. Therefore, if an entry in the lookup table is 0, you know you need to count that one, but once you've calc'd it once, you can store it there. This way, you don't have to count all 256 when you start.Beverlybevers
@glowcoder, that's a good suggestion. This questions sounds like a homework problem, though, so I think it's a bit overkill. I'd say it's much less complicated to just pre-generate the table.Scatter
You can do a 32 bit population count in 15 - 20 instructions (see e.g. Hacker's Delight by Warren). Breaking up the word into 8 bit chunks, doing 4 table lookups and then summing the 4 results is probably not going to be as efficient as this, and it doesn't lend itself to optimisation, e.g. SIMD, GPGPU, etc.Froe
The table access could be much slower than a clever computation inside the CPU.Hage
With SSSE3, use pshufb to do sixteen 4bit LUT lookups in parallel. If the popcnt instruction isn't available, but pshufb is, it's the best option. Without either, IDK whether a 256B byte-LUT is better than the bithack way.Broadminded
S
8

In processors that have SSE4 support, you have the POPCNT instruction that does this for you.

The most naive algorithm is actually faster than what you thought up (DIV instructions are really slow).

mov eax, [number]
xor ecx,ecx
loop_start:
  test eax,1
  jnz next
  inc ecx
next:
  shr eax, 1
  mov eax,ecx

Regarding your comment about previous SO answers, I'm going to take an example answer from there and walk you through how I'll convert it.

long count_bits(long n) {     
  unsigned int c; // c accumulates the total bits set in v
  for (c = 0; n; c++) 
    n &= n - 1; // clear the least significant bit set
  return c;
}

(I'm going to assume you know how to define a function and fun stuff like that). What is needed is a very simple loop, a counter variable (traditionally, ecx is both the index and a counter), and bit testing instructions.

    mov edx,n
    xor ecx,ecx
loop_start:
    test edx,edx
    jz end
    mov ebx,edx
    dec ebx
    and edx,ebx
    inc ecx
    jmp loop_start
end:
    mov eax,ecx
    ret

Implementing something like the Hamming Weight algorithm in assembly isn't complicated, but is just complicated enough that you'd rather not do it as an initial homework problem.

Sumo answered 28/5, 2010 at 19:12 Comment(0)
H
5

My x86 assembler is a bit rusty, but this comes to mind:

clc            ; clear carry
xor ecx, ecx   ; clear ecx

shl eax, 1     ; shift off one bit into carry
adc ecx, 0     ; add carry flag to ecx
; ... repeat the last two opcodes 31 more times

ecx contains your bit count.

x86 shift instructions set CF to the last bit shifted out, where adc ecx, 0 reads it.

Heelpost answered 28/5, 2010 at 18:38 Comment(2)
You don't need clc because shl eax unconditionally sets CF to the bit shifted out. adc is probably the best way to implement the naive way, but you can exit the loop when eax becomes zero, rather than always doing 32 iterations. However, any kind of bit-at-a-time loop is significantly slower than the best bithack or LUT (pshufb) options.Broadminded
I added an answer to this question showing the bithack asm, and also a loop with adc/shr/jnz as the body. It would not be worth fully unrolling without an early-out, but could be worth unrolling by 2 if you still care more about small code size than speed, but want a bit more front-end throughput. The bithack version is certainly much better than fully unrolling, about 15 uops vs. 64.Broadminded
B
3

For the record, if you want good performance, you usually want to avoid looping / branching, with either an 8-bit table lookup or a multiply bithack (GCC's current scalar fallback for __builtin_popcnt without -mpopcnt). Looping can be barely ok if your numbers are usually small (right shift by 1), or if your numbers usually only have a few bits set (looping on clearing the lowest set bit with x & (x-1)). But those perform rather poorly for numbers with half or more of their bits set.


Most modern x86 CPUs support the popcnt instruction. It's implied by SSE4.2, but also has its own CPUID feature bit so a CPU could have it without SSE4.2. Intel Core 2 and older do not have this.

xor     eax,eax     ; avoid false dependency on Sandybridge-family before IceLake
popcnt  eax,  edi

If you don't mind overwriting the same register, popcnt edi, edi for example avoids the danger of an output false dependency: you already have a true dependency on the same register. (Why does breaking the "output dependency" of LZCNT matter?)


Without HW popcnt, another option is SSSE3 pshufb, which is actually great for counting large arrays, especially if you have AVX2. See


Fallbacks with baseline x86 instructions

An array lookup is possible, extracting each byte with movzx ecx, al / movzx edx, ah / shr eax, 16 etc. Then movzx ecx, [table + rcx] / add cl, [table + rdx]. Note that the total result will be at most 64, so won't overflow an 8-bit register. That would need a 256-byte table to stay hot in cache for good performance. It may be a good choice if you do a lot of popcnt but can't use SIMD; benchmark it against the bithack for your use-case.

A bithack from https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel / How to count the number of set bits in a 32-bit integer? is what GCC currently uses if HW popcnt isn't enabled at compile time. (i.e. in the libgcc helper function). See that answer for an explanation of how/why the bithack sums bits to 2-bit accumulators, then horizontally again to 4-bit, etc. (Fun fact: GCC and clang actually recognize that C logic as a popcnt idiom and compile it to a popcnt instruction with -mpopcnt. The following asm is GCC -O3 output without -mpopcnt; I don't see any way to improve it by hand. It's using EAX as the destination as much as possible for AND to allow the and eax, imm32 short-form without a modrm byte.)

This non-branching code and doesn't need any data lookups, so it can't cache miss (except for I-cache), and is probably good if you care about popcount performance (especially latency) but don't do it often enough to keep a lookup table hot in cache. (Or for 64-bit integers, a 64-bit version of this is probably even better than 8x byte lookups.)

; x86-64 System V calling convention
; but also of course works for 32-bit mode with the arg in a register
numberOfSetBits:     ; 32-bit unsigned int x    in EDI
    mov    eax, edi
    shr    eax, 1
    and    eax, 0x55555555          ; (x>>1) & 0x55555555
    sub    edi, eax                 ; x -= ((x>>1) & 0x55555555)   2-bit sums

    mov    eax, edi
    shr    edi, 0x2
    and    eax, 0x33333333
    and    edi, 0x33333333
    add    edi, eax                 ; pairs of 2-bit accumulators -> 4

    mov    eax, edi
    shr    eax, 0x4
    add    eax, edi                 ; we can add before masking this time without overflow risk
    and    eax, 0x0f0f0f0f

    imul   eax, eax, 0x01010101       ; sum the 4 bytes into the high byte (because their values are small enough)
    shr    eax, 24
    ret    

For 64-bit integers, it's the same sequence, ending with a 64-bit multiply. (But you need mov reg, imm64 to materialize 64-bit mask and multiplier constants; they won't work as immediates to AND or IMUL).

Instructions like RORX could be useful to more efficiently copy-and-shift instead of mov/shr, but any CPU with RORX would also have POPCNT so you should just use that! LEA to copy-and-left-shift doesn't help: addition propagates carry from low to high, so to avoid losing bits at the top in the first step, you need to be right-shifting. The >>2 step couldn't add into the higher of each pair of 2-bit accumulators either: the max sum at that point is 4, and that requires 3 bits to represent, so the highest accumulator (at the top of the register) would possibly lose a count if you did lea eax, [rdi + rdi] / 2x and / add, because instead of 4 bits misaligned, it only has 2. And you'd eventually need a right shift to put counters back at the bottom of their bytes at some point before imul, so you'd lengthen the critical path latency even if it was possible to use left-shift/add in earlier steps.

Looping: smaller code-size, much slower worst-case

There are three main choices:

  • Lookup table of 8-bit chunks, used 4 times
  • shift by 1 (left with add same,same or right with shr) and add the bit shifted out. Less bad if the set bits are usually clustered towards the high or low end so the register becomes zero after much fewer than 32 iterations, but that's still the worst case.
  • clear the lowest set bit with x &= x-1 and count how many iterations to become zero. Less bad if there are few set bits total. (Or if you NOT the input first, if there are few cleared bits. Or maybe there's a bithack for setting the lowest zeroed bit, like x |= x+1 maybe?). Worst case is still 32 iterations, with a longer dep chain than just shifting.

For small code-size (but not speed), the loop shown in Hamming weight ( number of 1 in a number) mixing C with assembly is quite good. A NASM version of that looks like:

;;;   Good for small inputs (all set bits near the bottom)
;; input: EDI  (zeroed when we're done)
;; output: EAX = popcnt(EDI)
popcount_shr_loop:
    xor   eax, eax
  ; optional: make the first adc non-redundant by peeling the first iteration.  Otherwise just fall into the loop (with CF=0 from xor)
    shr   edi, 1         ; shift low bit into CF
                 ;; jz .done   ; not worth running an extra instruction for every case to skip the loop body only for the input == 0 or 1 case
 .loop:
    adc   eax, 0         ; add CF (0 or 1) to result
    shr   edi, 1
    jnz   .loop          ; leave the loop after shifting out the last bit
 ;.done:
    adc   eax, 0         ; and add that last bit
    ret

If the set bits in your input are likely to be near the top, use add edi, edi instead of shr, since it sets FLAGS we care about the same as shl would. add can macro-fuse with jcc on Sandybridge-family, so that's actually slightly better than shr; more hyperthreading-friendly, and fewer uops in the ROB so OoO exec can see farther past it, if the loop-exit branch predicts correctly. Or into the loop sooner if an earlier cache miss or something is still stalling retirement.

For even smaller code-size, you could skip the shr before falling into the loop, so the first adc is redundant. (xor-zeroing clears CF).

@spoulson's answer suggests unrolling the loop 32 times (without jz .done). The bithack shift/and/add ending with multiply is better when you want one big straight-line block of code for maximum speed with arbitrary bit-patterns. adc reg,0 is 1 uop on most CPUs, except Intel P6-family (PPro to Nehalem) (0 was a special case on Intel SnB-family before Broadwell). Anyway, 64 uops and 32-cycle latency is still bad vs. the 15-uop bithack, so a full unroll of this would be worse than other strategies.

However, unrolling this by 2 or 4 could make sense as a middle-ground. That would make different inputs branch the same way, e.g. every input with its set bits in the low 4 would run through the loop once, with the branch not-taken.

popcount_shr_loop_unroll2:
    xor   eax, eax
    shr   edi, 1         ; shift low bit into CF
          ;; jz .done     ; still optional, but saves more work in the input <= 1 case.  Still not worth it unless you expect that to be very common.
 .loop:
%rep 2            ;; Unroll
    adc   eax, 0         ; add CF (0 or 1) to result
    shr   edi, 1
%endrep           ;; still ending with ZF and CF set from a shift
    jnz   .loop          ; leave the loop on EDI == 0
 ;.done:
    adc   eax, 0         ; there may still be a bit we haven't added yet
    ret

You could try to let out-of-order exec see the loop-exit condition sooner by doing shr edi, 4 / jnz as the loop branch, and have the loop body copy EDI to another register and shift out the low 4 bits 1 at a time. But at that point you probably just want the bithack version; x86 CPUs with OoO exec also have fast imul r32, like 4 cycle latency on Pentium II/III, 3-cycle on AMD K8 and later, and Intel since Core 2. And their code fetch / decode capability should handle the larger instructions involving 32-bit mask constants well enough.

(Since we're considering old CPUs: On P5 Pentium, shr and adc can both only run in the U-pipe, so unrolling doesn't let them pair with each other to exploit the ILP. It would if you used add to shift the high bit into CR, though, since add can run in either the U or V pipe.)

Another unroll option is to split into two halves, high half going out the top, low half out the bottom. (Accumulate into separate counters, too, if you care about latency, otherwise it could still help OoO exec find the loop exit sooner. But then testing for both halves being zero gets clunky; maybe mov ecx, ebx/add ecx, edx/jnz. ADD can macro-fuse with jnz on SnB-family, unlike OR. Or use LEA / TEST+JNZ, 2 front-end uops on AMD Zen as well as Intel.)


Another option is looping on lea edx, [rdi-1] / and edi, edx (clear the lowest set bit, set ZF if it became zero). This can be ok for numbers with only a couple set bits.

  ;; could be good if very few bits are set, even if they're scattered around
;; Input: EDI  (zeroed when done)
;; output: EAX = popcount(EDI)
;; clobbers: EDX
popcount_loop_lsr:
    xor  eax,eax
    test edi,edi
    jz   .done            ; if(!x) return 0;
 .loop:                   ; do{
    inc  eax                 ; ++count
    lea  edx, [rdi-1]
    and  edi, edx            ; x &= x-1  clear lowest set bit
    jnz  .loop            ; }while(x)

 .done:
    ret

For more bithacks like x & (x-1), see https://catonmat.net/low-level-bit-hacks. Also note that the BMI1 instruction blsr does this, so that's a handy place to check as a reminder of the formula when you already have an x86 instruction reference open. But of course if you had BMI1, you'd have popcnt. popcnt actually has its own feature-bit, but there aren't any real-world CPUs that have BMI1 but not popcnt/SSE4.2.

Note that this has a 2-cycle loop-carried dependency through LEA and AND, unlike the 1-cycle dependency through SHR and ADC (assuming single-uop ADC) in the other loop. So each iteration has twice as long a data dependency. But on the plus side, we're only looping over the set bits, skipping past zeros. Still, the worst case (EDI=-1) has twice the latency.

and/jnz can actually macro-fuse on Intel SnB-family into a single and-and-branch uop. (Because it's like test). So it's still only 3 front-end uops per iteration, but the branch mispredict is unlikely to be detected soon, so in terms of overall front-end cost this version can be bad.

Since inc eax is just counting loop iterations, no data dependency on the x update logic, unrolling would still require a branch, I think, unless you did some extra logic after the loop to check if a middle temporary had already been zero. Since the x &= x-1; dep chain is the critical path, unrolling is probably not helpful.

(If you want to find the position of every set bit and store into an array, you can unroll with overshoot if you have a separate efficient way to popcount, as in @aqrit's answer on another Q&A)

Broadminded answered 3/5, 2021 at 4:6 Comment(0)
H
1
      mov eax,[c]
      xor ebx,ebx
SSS:  shr eax,1    ; after shift, if eax=0 ZF flag=1
      jz  XXX      ; end (no more bit on eax)
      adc bl
      jmp SSS
XXX:  adc bl
      movb [Nbit],bl
Hollerman answered 6/8, 2017 at 17:4 Comment(3)
You could modify the loop to only have a jnz at the bottom, instead of a jmp and a jz. On entry, jump to the shr in the middle of the loop. SSS: adc/shr/jnz SSS / adc. Since it's ok to do an extra iteration, you could also peel some unrolled iterations at the start so you can fall into the loop. e.g. mov ebx,eax/ and ebx,1 / shr eax, 2 / then fall into the loop for the first adc. Of course if you cared about performance, you wouldn't be using this naive loop (unless your values were almost always 0 to 3 or something, when this might be faster than the bithacks)Broadminded
Hamming weight ( number of 1 in a number) mixing C with assembly shows an example of efficiently looping, using both the ZF and CF results from one SHR, but still only 3 instructions in the loop.Broadminded
adc bl is not a valid instruction. Perhaps you meant adc bl, 0Newberry
S
0

This program gives you the number of 1's in a 32 bit number. Try out :)

extern printf                     
SECTION .data                   
msg:    db "The number of 1 bits are: %d",10,0
inta1:  dd  1234567  
num: dd  2147483647   
SECTION .text                     

global  main                  
main:     
    mov eax, [num]  
    mov ecx,32  
    mov edx,0  
.loop:  dec ecx  
    cmp ecx,0  
    jl .exit  
    shr eax,1  
    jnc .loop  
    inc edx  
jmp .loop 
.exit:
    push edx
    push    dword msg         
    call    printf            
    add     esp, 8  
Sequestration answered 11/5, 2016 at 18:15 Comment(1)
See also @ChrisDodd's very similar answer to a question from this user on how to count bits. (This isn't plagiarism, though, since the logic is different and less efficient, and the main program wrapped around it is original work.) Also note that a ret instruction at the end of this would make it not crash.Broadminded
N
0

Using bsf (Bit Scan Forward) is probably a bit more efficient than plain shifting.

xor         edx,edx  
mov         eax,num  
bsf         ecx,eax
je          end_bit_count
; align?
loop_bit_count:
inc         ecx  
inc         edx  
shr         eax,cl  
bsf         ecx,eax  
jne         loop_bit_count
end_bit_count:
Nostology answered 18/2, 2018 at 14:57 Comment(6)
Probably yes for inputs with few bits set but where those bits are sparse instead of clustered at end that's shifted out first. But note that variable-count shl costs 3 uops on Sandybridge-family, and that bsf has a false dependency on the output, so here's a loop-carried dependency chain on ecx. #21390665. (Although that 2-cycle dep chain maybe not be a bottleneck.)Broadminded
Anyway, using the n & (n-1) bithack to clear the lowest set bit is going to be better than BSF / SHR. Do that with inc ecx / lea edx, [rax-1]` / and eax, edx / jnz loop_bit_count (with a check to skip the loop if initial eax=0, or branchlessly set the initial ecx to -1 if the input is zero). Or use BMI1 blsr to do the n&(n-1) in one instruction which sets ZF.Broadminded
But a non-looping implementation is almost certainly the best bet if you care about optimization, because branch misprediction kill performance with data-dependent branching unless the patterns are very predictable. (The whole idea of your answer is to loop popcnt(n) times, rather than a fixed 32 times.) The bithack involving a multiply to move bits where they belong is very good, and can be implemented efficiently in x86 asm (by a compiler if you want).Broadminded
One could expand the block with a macro, but it would become a rather large chunk. Anyway, the bithack is very interesting, so is the rest of your comment. So thanks.Nostology
re: the loop-carried dep chain. Silly me: the other input for bsf is ready at least a cycle after ecx, so the false dependency is totally irrelevant. The loop has about a 3 cycle loop-carried dep chain, not 2: inc ecx -> shr -> bsf -> repeat.Broadminded
Unrolling the block with a macro instead of a runtime loop wouldn't help: you'd need to conditionally increment, or conditionally jz out of the fully-unrolled block. You could make the conditional increment branchless and unroll by 4 or something, so branch prediction would have an easier pattern if it was common to have values with nearby counts. (i.e. so the loop would run the same number of iterations for some inputs that produce nearby output values.)Broadminded
B
-1
    mov eax,dword [number]; we store the number in eax
    mov ecx,1
    mov edx,0
    loop_1:
    cmp eax,0            ;we compare the number with 0 
    je endl_loop         ;when the number is zero we exit the loop
    test eax,01h         ;is the last bit equal to 1?
    jpe the_bit_is_zero  ;jump if parity is even=the bit is zero
    inc edx              ;we found another 1 digit
    the_bit_is_zero:
    inc ecx              ;we continue the loop
    shr eax,1            ;shift the bits to right =nr/2
    loop loop_1
    endl_loop:
    ;the result is stored in edx
Bautram answered 1/2, 2022 at 19:14 Comment(1)
What's the point of using the loop instruction instead of jmp if you keep adjusting ECX so it's always taken? This seems over-complicated compared to the loops in other answers, with no advantages. This seems more like a beginner attempt that belongs on codereview.stackexchange.com, not as an answer we'd recommend future readers actually use or learn from. Also, test / jz is the idiomatic way to see if any bits were set; jpe may be slower on some CPUs, and it non-obvious for human readers.Broadminded
W
-3

The best way:

tabx:array [0..255] of byte = //number of bit for each byte (COPY THIS TABLE)
    (0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,
     1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
     1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
     2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
     1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
     2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
     2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
     3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
     1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
     2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
     2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
     3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
     2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
     3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
     3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
     4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8);

In MASM:
asm
mov   eax,number //32 bit 
movzx ecx,tabx[al] //for clear ecx except cl
addb  cl,tabx[ah]  //add ah to cl  
shr   eax,16  //put left part in ah-al
addb  cl,tabx[al]
addb  cl,tabx[ah]
mov   result,ecx

Whitewash answered 6/7, 2019 at 21:56 Comment(1)
tabx[ah] or al isn't a valid addressing mode; any registers have to be of address width. You obviously didn't even try assembling this. (Or compiling it, since it looks like MSVC inline asm.) In general a table lookup is a reasonable strategy for machines without hardware popcnt, but ALU bithacks are probably better if you don't need it very often.Broadminded

© 2022 - 2025 — McMap. All rights reserved.