Hamming weight ( number of 1 in a number) mixing C with assembly

Asked 20/11, 2014 at 22:13 Answered 21/11, 2014 at 13:2

Solved c assembly x86 bit-manipulation hammingweight

I'm trying to count how many number 1, are in the numbers of an array.

First I have a code in C lenguaje(work ok):

int popcount2(int* array, int len){
    int i;
    unsigned x;
    int result=0;
    for (i=0; i<len; i++){
        x = array[i];
        do{
           result+= x & 0x1;
           x>>= 1;
       } while(x);
    }
return result;
}

Now I need to translate the do-while loop into Assembly using 3-6 lines of code. I have write some code but the result is not correct.( I'm new in the assembly world )

int popcount3(int* array, int len){
int  i;
unsigned x;
int result=0;   
for (i=0; i<len; i++){
    x = array[i];
    asm(
    "ini3:               \n"
        "adc $0,%[r]     \n"
        "shr %[x]        \n"
        "jnz ini3        \n"

        : [r]"+r" (result)
        : [x] "r" (x)       );
  }
}

I'm using GCC ( on linux) with an Intel processor.

Wellfavored answered 20/11, 2014 at 22:13 Comment(2)

Your code is wrong for at least 2 reasons: 1) it doesn't initalize the carry flag, so it may be set and 2) the last bit you shift out isn't accounted for. – Hehre 20/11, 2014 at 23:10

Related: How to count the number of set bits in a 32-bit integer? shows a branchless bithack implementation which is quite nice and easy to implement on x86. If hardware popcnt or SSSE3 pshufb aren't available, it or a table lookup are probably your best bet. – Barclay 7/11, 2017 at 16:51

The nicest think you can do is using built in popcount function as suggested by Paul R, but since you need to write it in assembly, this worked for me:

asm (
"start:                  \n"
        "and %0, %1      \n"
        "jz end          \n"
        "shr $0, %1      \n"
        "jnc start       \n"
        "inc %1          \n"
        "jmp start       \n"
"end:                    \n"
        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);

At first two lines you just check the contents of x (and go to end if it's zero Jump Zero). Than you shift x one bit to right and:

At the end of the shift operation, the CF flag contains the last bit shifted out of the destinationoperand. *

If there's no CF set, just go to start (Jump Not Carry) else increment result and then go to start.

And the beautiful think about assembly is that you can do things in so many ways...

asm (
"start:                  \n"
        "shr $1, %1      \n"
        "jnc loop_cond   \n"
        "inc %0          \n"
        "and %1, %1      \n"
"loop_cond:              \n"
        "jnz start       \n"

        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);

Here you again use SHift Right instruction, if no CF is present just go to loop condition.

Otherwise again increment result and call binary AND (INC does modify ZF).

Using `LOOP` and `ECX`

I was curious how to do this in 3 instruction (I figured your teacher wouldn't give you bottom limit of 3 if it wasn't possible) and I realized x86 also has LOOP instruction:

Each time the LOOP instruction is executed, the count register is decremented, then checked for 0. If the count is 0, the loop is terminated and program execution continues with the instruction following the LOOP instruction. If the count is not zero, a near jump is performed to the destination (target) operand, which is presumably the instruction at the beginning of the loop. *

And you can add input argument using GCCs input constrain:

c - The c register.

asm (
"start:              \n"
    "shr $1, %1      \n"
    "adc $0, %0      \n"
    "loop start      \n"

    : "+g" (result)
    : "r" (x),
      "c" (8)             // Assuming 8b type (char)
);

Just to make sure it compiles to proper assembly:

0x000000000040051f <+25>:   mov    $0x8,%ecx
0x0000000000400524 <+30>:   mov    -0x8(%rbp),%eax
0x0000000000400527 <+33>:   shr    %edx
0x0000000000400529 <+35>:   adc    $0x0,%eax
0x000000000040052c <+38>:   loop   0x400527 <main+33>

I think the first one should have a bit better performance, especially if there's just 1 bit set, this approach always does k*8 iterations.

SSE4 and single instruction

I know you have to use a loop, but just for fun... With SSE4 extension you could do this by just one instruction POPCNT:

This instruction calculates of number of bits set to 1 in the second operand (source) and returns the count in the first operand (a destination register). *

I think (I have a quite old CPU on my notebook, so I can't test this for you) you should be able to do this with just one simple instruction:

asm (   
    "POPCNT %1, %0   \n"
    : "=r" (result)
    : "mr" (x)
    : "cc"                                                                                                                                       
);

(If you try this and you do have SSE4 extension, please let me know if it works)

Performance

I've measured times required to do 100,000,000 popcounts comparing my first and second methods with David Wohlferd's. ^{[Raw data]}

+--------------+------------+------------+------------+
|              | 0x00000000 | 0x80000001 | 0xffffffff |
+--------------+------------+------------+------------+
| 1st solution |  0.543     |  5.040     |  3.833     |
| LOOP         | 11.530     | 11.523     | 11.523     |
| Davids       |  0.750     |  4.893     |  4.890     |
+--------------+------------+------------+------------+

If anyone can compare these 3 with SSE4's POPCNT instruction I'd be glad.

Reentry answered 20/11, 2014 at 23:45 Comment(5)

In both of these examples, you are declaring x as an input, but you are changing its value. Very bad idea. Only outputs can (safely) be changed. Maybe something more like: "+g" (result), "=r" (junk) : "1" (x). Also (speaking for myself), I find using symbolic names makes the code easier to read (inc %[result] instead of inc %2. And what's with using %2? Parameters are zero based, so there is no %2? – Longobard 21/11, 2014 at 1:27

@DavidWohlferd thanks for notes, I didn't know that about the output/input.... And the %2 was typo when rewriting answer from "working version" (I was on my way to bed when I've saw the question and got interested). And indeed it's easier to work with symbolic names. – Reentry 21/11, 2014 at 9:44

You don't need the loop instruction. It's slow. You can do it with 3 insns inside the loop this way: With CF cleared on loop entry (e.g. from xor-zeroing result), .l: adc $0, %[res] ; shr $1, %[x] ; jnz .l; Then after the loop you need one last adc $0, %[res]. Oh, that's what David already posted. – Barclay 7/11, 2017 at 16:4

The first code block (with shr with a count of $0) appears to be buggy as well as inefficient. – Barclay 7/11, 2017 at 16:7

The asm constraints on the version using loop are still unsafe: you tell the compiler that "r"(x) and ECX are a read-only inputs, but you destroy them. – Barclay 27/2, 2019 at 23:5

You are starting out with a really inefficient algorithm - if you use a better algorithm then you may not need to waste time with assembler. See Hacker's Delight and/or Bit Twiddling Hacks for much more efficient methods.

Note also that newer x86 CPUs have a POPCNT instruction which does all of the above in one instruction (and you can call it via an intrinsic too, so no need for asm).

And finally gcc has a builtin: __builtin_popcount, which again does all you need - it will use POPCNT on newer CPUs and equivalent asm on older CPUs.

Casandracasanova answered 20/11, 2014 at 22:18 Comment(4)

My guess would be that this is a homework... So although you're providing probably the best solution, you're not answering the actual question "Now I need to translate the do-while loop into Assembly using 3-6 lines of code.". – Reentry 20/11, 2014 at 22:21

@Vyktor: I fear you may be right, but the OP didn't mention anything about this being homework. I will leave this answer here though, as it may be useful to someone searching for a (non-homework) solution in the future. – Casandracasanova 20/11, 2014 at 22:23

Thanks alot for your answer .But now I'm learning assembly an this is an exercise and I must resolve it with this algorithm. However I will have your answer in mind for a future ( when I know something more assembler :) ). Do you know how to resolve using my algorithm ? – Wellfavored 20/11, 2014 at 23:9

@Antonio: OK - for future reference you should probably make constraints such as this clear in your question so that people don't waste time writing answers that you can not use. As for your assignment, use gcc -S to see what code the compiler generates for the C version and use that as a template for your inline asm. – Casandracasanova 20/11, 2014 at 23:13

When I needed to create a popcount, I ended up using the 5's and 3's method from the Bit Twiddling Hacks @PaulR mentioned. But if I wanted to do this with a loop, maybe something like this:

#include <stdio.h>
#include <stdlib.h>

int popcount2(int v) {
   int result = 0;
   int junk;

   asm (
        "shr $1, %[v]      \n\t"   // shift low bit into CF
        "jz done           \n"     // and skip the loop if that was the only set bit
     "start:               \n\t"
        "adc $0, %[result] \n\t"   // add CF (0 or 1) to result
        "shr $1, %[v]      \n\t"
        "jnz start         \n"     // leave the loop after shifting out the last bit
     "done:                \n\t"
        "adc $0, %[result] \n\t"   // and add that last bit

        : [result] "+r" (result), "=r" (junk)
        : [v] "1" (v)
        : "cc"
   );

   return result;
}

int main(int argc, char *argv[])
{
   for (int x=0; x < argc-1; x++)
   {
      int v = atoi(argv[x+1]);

      printf("%d %d\n", v, popcount2(v));
   }
}

adc is almost always more efficient than branching on CF.

"=r" (junk) is a dummy output operand that is in the same register as v (the "1" constraint). We're using this to tell the compiler that the asm statement destroys the v input. We could have used [v] "+r"(v) to get a read-write operand, but we don't want the C variable v to be updated.

Note that the loop trip-count for this implementation is the position of the highest set bit. (bsr, or 32 - clz(v)). @rcgldr's implementation which clears the lowest set bit every iteration will typically be faster when the number of set bits is low but they're not all near the bottom of the integer.

Longobard answered 21/11, 2014 at 2:28 Comment(5)

I'm also still new to this stuff... Could you please explain your algorithm and =r(junk)? – Reentry 21/11, 2014 at 9:50

'junk' isn't some special keyword or anything. It's just a variable (declared above). The thing that makes it interesting is in the inputs where I declare that (v) should use the constraint "1". This means it will share the same place as parameter #1. So, on input the register will contain 'v'. On output, it will contain the variable junk (meaning gcc won't mistakenly believe it can still find 'v' in that register). Since gcc never needs the contents of junk, the register is immediately available for re-use. – Longobard 21/11, 2014 at 10:40

As for the algorithm, well, the most interesting thing is that 'adc' means 'add with carry'. If carry is not set, then 'adc $0, %[v]' does nothing. Otherwise it adds 1. This instruction is normally useful when adding REALLY big numbers (think: adding two 96 bit integers: 'add' the first 2 integers, then 'adc' then next two making sure to account for the overflow from the first add, etc). This allows us to have fewer jump statements, which (typically, but not always) results in faster code. I haven't actually timed it though, so yours could still be faster. – Longobard 21/11, 2014 at 10:42

Thanks for explaining (I think you should move parts of the comments to your answer so it would provide more detailed, "teaching" explanation). I don't work with this, it's just a bit challenging stuff for me to do so that's why I'm interested. Anyway +1. – Reentry 21/11, 2014 at 10:47

I wrote up a NASM version of this for NASM: Count how many bits in a 32 Bit number are set to 1 to compare it to the bithack. I noticed jz done at the top is probably not worth doing: it's fine to always run the loop body exacly once for an input of 0, 1, 2, or 3. For all other inputs, the jz is not-taken, so it's just a wasted uop that might mispredict. Unless inputs of 0 or 1 are very common, it's not worth a jz to speed up that case at the slight expense of others. (Or more than slight if it ever mispredicts.) – Barclay 4/5, 2021 at 9:3

assembly using 3-6 lines of code.

This example uses a 4 instruction loop:

popcntx proc    near
        mov     ecx,[esp+4]             ;ecx = value to popcnt
        xor     eax,eax                 ;will be popcnt
        test    ecx,ecx                 ;br if ecx == 0
        jz      popc1
popc0:  lea     edx,[ecx-1]             ;edx = ecx-1
        inc     eax                     ;eax += 1
        and     ecx,edx                 ;ecx &= (ecx-1)
        jnz     short popc0
popc1:  ret
popcntx endp

This example uses a 3 instruction loop, but it would be slower than the 4 instruction loop version on most processors.

popcntx proc    near
        mov     eax,[esp+4]             ;eax = value to popcnt
        mov     ecx,32                  ;ecx = max # 1 bits
        test    eax,eax                 ;br if eax == 0
        jz      popc1
popc0:  lea     edx,[eax-1]             ;eax &= (eax-1)
        and     eax,edx
        loopnz  popc0
popc1:  neg     ecx
        lea     eax,[ecx+32]
        ret
popcntx endp

This is an alternative non-looping example:

popcntx proc    near
        mov     ecx,[esp+4]             ;ecx = value to popcnt
        mov     edx,ecx                 ;edx = ecx
        shr     edx,1                   ;mov upr 2 bit field bits to lwr
        and     edx,055555555h          ; and mask them
        sub     ecx,edx                 ;ecx = 2 bit field counts
                                        ; 0->0, 1->1, 2->1, 3->1
        mov     eax,ecx
        shr     ecx,02h                 ;mov upr 2 bit field counts to lwr
        and     eax,033333333h          ;eax = lwr 2 bit field counts
        and     ecx,033333333h          ;edx = upr 2 bit field counts
        add     ecx,eax                 ;ecx = 4 bit field counts
        mov     eax,ecx
        shr     eax,04h                 ;mov upr 4 bit field counts to lwr
        add     eax,ecx                 ;eax = 8 bit field counts
        and     eax,00f0f0f0fh          ; after the and
        imul    eax,eax,01010101h       ;eax bit 24->28 = bit count
        shr     eax,018h                ;eax bit 0->4 = bit count
        ret
popcntx endp

Stribling answered 21/11, 2014 at 13:2 Comment(13)

Looks like it would be simpler to test eax,eax / jz .end (after zeroing ecx) to skip the loop if eax=0, instead of always jumping to the loop-condition. Same number of instructions, because you save the mov ebx,eax at the top of the loop. Also, use lea ebx, [rax-1] (or [eax-1] if this is 32-bit code) instead of mov / dec. Or with BMI1, use blsr ebx, eax / jnz. (Matt Godbolt mentioned this trick in his CppCon2017 talk youtube.com/watch?v=bSkpMdDe4g4). – Barclay 7/11, 2017 at 16:13

Also, use edx instead of ebx; edx is call-clobbered in typical calling conventions. – Barclay 7/11, 2017 at 16:13

I found it with a search for [x86] popcnt looking for a dup target for #47162450. lea edx, [eax-1] is 3 bytes: opcode, ModR/M, disp8. LEA only gets bulky if you use an index with no base to copy+shift (like lea edx, [eax*4]) because the only way to encode that is with a disp32=0. – Barclay 7/11, 2017 at 16:34

The optimized version of this I suggested (with lea) is very likely faster than a shift loop for many inputs (especially relatively sparse inputs with the set bits not clustered at the bottom). Especially on Intel where adc is 2 uops before Broadwell, and especially SnB-family where and/jnz macro-fuse into a compare-and-branch. Unrolling might be helpful. (You still need to branch, but the inner branch is and/jz to the end.) – Barclay 7/11, 2017 at 16:38

With many bits set, an 8-bit LUT has better throughput, but of course can cache miss if not used frequently. With SSSE3 but not popcnt, pshufb is reasonable. Except on SlowShuffle CPUs like Merom, and most of the uarches with pshufb but not popcnt have slow pshufb... But it's useful if you don't want to make a separate SSE4 version, just an SSSE3 version. (Not sure about using it for scalar data, though.) – Barclay 7/11, 2017 at 16:42

Oh right, derp yeah there are better bithack versions. The mul / sub one is probably best. #109523 – Barclay 7/11, 2017 at 16:43

But this is potentially good if the expected count is very low. It will suffer a branch miss, though. The branchless bithack may be best depending on the surrounding code. – Barclay 7/11, 2017 at 16:44

For the love of science, why would you use loopnz? Are you tuning for AMD specifically? On Intel CPUs, that's much worse than inc ecx / ... / jnz. Sure it's interesting, but only show that as the novelty 3-instruction trick version, not as the recommended / good version. (See #109523 for the non-looping version. Feed that to a compiler with unsigned). – Barclay 7/11, 2017 at 17:1

Avoiding loopnz is much faster on Intel (where loopnz is very slow), slightly slower on AMD Bulldozer-family (where loopnz is a single uop, but and/jnz don't fuse). K10 has slow loop as well; it's only Bulldozer-family / Ryzen where loopnz is fast. See the link in my previous comment. – Barclay 7/11, 2017 at 17:15

popcnt is probably not a good choice of function name. It's bound to lead to confusion when porting to NASM or something, because of the clash with the instruction name. Further suggestion: link to the bithack question for the non-loop version, and mention that it's probably faster. – Barclay 7/11, 2017 at 17:19

I'd recommend putting the loopnz version last. It's more of a "here's how you can jump through a hoop and win at code-size". It doesn't belong as the first code block. Other than that, nice answer. – Barclay 7/11, 2017 at 17:25

It's not always a good thing to make the most important / first part of your answer cover the weird special case of a specific question. Sure, answer it somewhere, but answering the generic question well is also important. – Barclay 7/11, 2017 at 17:40

@PeterCordes - I reordered the example, but left the non-looping version last so that the 4 and 3 instruction loop examples follow the quoted bit from the OP's question. – Stribling 7/11, 2017 at 17:50

The nicest think you can do is using built in popcount function as suggested by Paul R, but since you need to write it in assembly, this worked for me:

asm (
"start:                  \n"
        "and %0, %1      \n"
        "jz end          \n"
        "shr $0, %1      \n"
        "jnc start       \n"
        "inc %1          \n"
        "jmp start       \n"
"end:                    \n"
        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);

At first two lines you just check the contents of x (and go to end if it's zero Jump Zero). Than you shift x one bit to right and:

At the end of the shift operation, the CF flag contains the last bit shifted out of the destinationoperand. *

If there's no CF set, just go to start (Jump Not Carry) else increment result and then go to start.

And the beautiful think about assembly is that you can do things in so many ways...

asm (
"start:                  \n"
        "shr $1, %1      \n"
        "jnc loop_cond   \n"
        "inc %0          \n"
        "and %1, %1      \n"
"loop_cond:              \n"
        "jnz start       \n"

        : "+g" (result),
          "+r" (x)
        :
        : "cc"
);

Here you again use SHift Right instruction, if no CF is present just go to loop condition.

Otherwise again increment result and call binary AND (INC does modify ZF).

Using `LOOP` and `ECX`

I was curious how to do this in 3 instruction (I figured your teacher wouldn't give you bottom limit of 3 if it wasn't possible) and I realized x86 also has LOOP instruction:

Each time the LOOP instruction is executed, the count register is decremented, then checked for 0. If the count is 0, the loop is terminated and program execution continues with the instruction following the LOOP instruction. If the count is not zero, a near jump is performed to the destination (target) operand, which is presumably the instruction at the beginning of the loop. *

And you can add input argument using GCCs input constrain:

c - The c register.

asm (
"start:              \n"
    "shr $1, %1      \n"
    "adc $0, %0      \n"
    "loop start      \n"

    : "+g" (result)
    : "r" (x),
      "c" (8)             // Assuming 8b type (char)
);

Just to make sure it compiles to proper assembly:

0x000000000040051f <+25>:   mov    $0x8,%ecx
0x0000000000400524 <+30>:   mov    -0x8(%rbp),%eax
0x0000000000400527 <+33>:   shr    %edx
0x0000000000400529 <+35>:   adc    $0x0,%eax
0x000000000040052c <+38>:   loop   0x400527 <main+33>

I think the first one should have a bit better performance, especially if there's just 1 bit set, this approach always does k*8 iterations.

SSE4 and single instruction

I know you have to use a loop, but just for fun... With SSE4 extension you could do this by just one instruction POPCNT:

This instruction calculates of number of bits set to 1 in the second operand (source) and returns the count in the first operand (a destination register). *

I think (I have a quite old CPU on my notebook, so I can't test this for you) you should be able to do this with just one simple instruction:

asm (   
    "POPCNT %1, %0   \n"
    : "=r" (result)
    : "mr" (x)
    : "cc"                                                                                                                                       
);

(If you try this and you do have SSE4 extension, please let me know if it works)

Performance

I've measured times required to do 100,000,000 popcounts comparing my first and second methods with David Wohlferd's. ^{[Raw data]}

+--------------+------------+------------+------------+
|              | 0x00000000 | 0x80000001 | 0xffffffff |
+--------------+------------+------------+------------+
| 1st solution |  0.543     |  5.040     |  3.833     |
| LOOP         | 11.530     | 11.523     | 11.523     |
| Davids       |  0.750     |  4.893     |  4.890     |
+--------------+------------+------------+------------+

If anyone can compare these 3 with SSE4's POPCNT instruction I'd be glad.

Reentry answered 20/11, 2014 at 23:45 Comment(5)

The first code block (with shr with a count of $0) appears to be buggy as well as inefficient. – Barclay 7/11, 2017 at 16:7

The asm constraints on the version using loop are still unsafe: you tell the compiler that "r"(x) and ECX are a read-only inputs, but you destroy them. – Barclay 27/2, 2019 at 23:5

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Using LOOP and ECX

SSE4 and single instruction

Performance

Using LOOP and ECX

SSE4 and single instruction

Performance

Recommended topics

Hot tags

Using `LOOP` and `ECX`

Using `LOOP` and `ECX`