How does this algorithm to count the number of set bits in a 32-bit integer work?
Asked Answered
M

4

25
int SWAR(unsigned int i)
{
    i = i - ((i >> 1) & 0x55555555);
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
    return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

I have seen this code that counts the number of bits equals to 1 in 32-bit integer, and I noticed that its performance is better than __builtin_popcount but I can't understand the way it works.

Can someone give a detailed explanation of how this code works?

Mitten answered 27/2, 2014 at 22:16 Comment(3)
When you say performs better, a couple of questions arise, are you using a processor that actually implements the popcnt processor instruction, from what Im reading, that didn't come in until the Nehalem architecture (essentially SSE4.2 timeframe). The compiler will emulate normally if the architecture does not support. Also even if it does, then normally the reason will be that the compiler will insert code to check the processor support before using! That takes time. This version doesn't assume anything, and does it as optimal as it can assuming no HW supportAccrual
@Accrual I run this code on Intel Pentium III 733 MHz and it performs better than __builtin_popcount. I'm still more interested in how this code works rather than why it performs faster.Mitten
Well that explains the speed difference, that instruction wasn't available in your processor! I will get back to you on the functioning of this particular routine, but try to run through a couple of examples for yourself first, might be a fun exercise :-)Accrual
H
42

OK, let's go through the code line by line:

Line 1:

i = i - ((i >> 1) & 0x55555555);

First of all, the significance of the constant 0x55555555 is that, written using the Java / GCC style binary literal notation),

0x55555555 = 0b01010101010101010101010101010101

That is, all its odd-numbered bits (counting the lowest bit as bit 1 = odd) are 1, and all the even-numbered bits are 0.

The expression ((i >> 1) & 0x55555555) thus shifts the bits of i right by one, and then sets all the even-numbered bits to zero. (Equivalently, we could've first set all the odd-numbered bits of i to zero with & 0xAAAAAAAA and then shifted the result right by one bit.) For convenience, let's call this intermediate value j.

What happens when we subtract this j from the original i? Well, let's see what would happen if i had only two bits:

    i           j         i - j
----------------------------------
0 = 0b00    0 = 0b00    0 = 0b00
1 = 0b01    0 = 0b00    1 = 0b01
2 = 0b10    1 = 0b01    1 = 0b01
3 = 0b11    1 = 0b01    2 = 0b10

Hey! We've managed to count the bits of our two-bit number!

OK, but what if i has more than two bits set? In fact, it's pretty easy to check that the lowest two bits of i - j will still be given by the table above, and so will the third and fourth bits, and the fifth and sixth bits, and so and. In particular:

  • despite the >> 1, the lowest two bits of i - j are not affected by the third or higher bits of i, since they'll be masked out of j by the & 0x55555555; and

  • since the lowest two bits of j can never have a greater numerical value than those of i, the subtraction will never borrow from the third bit of i: thus, the lowest two bits of i also cannot affect the third or higher bits of i - j.

In fact, by repeating the same argument, we can see that the calculation on this line, in effect, applies the table above to each of the 16 two-bit blocks in i in parallel. That is, after executing this line, the lowest two bits of the new value of i will now contain the number of bits set among the corresponding bits in the original value of i, and so will the next two bits, and so on.

Line 2:

i = (i & 0x33333333) + ((i >> 2) & 0x33333333);

Compared to the first line, this one's quite simple. First, note that

0x33333333 = 0b00110011001100110011001100110011

Thus, i & 0x33333333 takes the two-bit counts calculated above and throws away every second one of them, while (i >> 2) & 0x33333333 does the same after shifting i right by two bits. Then we add the results together.

Thus, in effect, what this line does is take the bitcounts of the lowest two and the second-lowest two bits of the original input, computed on the previous line, and add them together to give the bitcount of the lowest four bits of the input. And, again, it does this in parallel for all the 8 four-bit blocks (= hex digits) of the input.

Line 3:

return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;

OK, what's going on here?

Well, first of all, (i + (i >> 4)) & 0x0F0F0F0F does exactly the same as the previous line, except it adds the adjacent four-bit bitcounts together to give the bitcounts of each eight-bit block (i.e. byte) of the input. (Here, unlike on the previous line, we can get away with moving the & outside the addition, since we know that the eight-bit bitcount can never exceed 8, and therefore will fit inside four bits without overflowing.)

Now we have a 32-bit number consisting of four 8-bit bytes, each byte holding the number of 1-bit in that byte of the original input. (Let's call these bytes A, B, C and D.) So what happens when we multiply this value (let's call it k) by 0x01010101?

Well, since 0x01010101 = (1 << 24) + (1 << 16) + (1 << 8) + 1, we have:

k * 0x01010101 = (k << 24) + (k << 16) + (k << 8) + k

Thus, the highest byte of the result ends up being the sum of:

  • its original value, due to the k term, plus
  • the value of the next lower byte, due to the k << 8 term, plus
  • the value of the second lower byte, due to the k << 16 term, plus
  • the value of the fourth and lowest byte, due to the k << 24 term.

(In general, there could also be carries from lower bytes, but since we know the value of each byte is at most 8, we know the addition will never overflow and create a carry.)

That is, the highest byte of k * 0x01010101 ends up being the sum of the bitcounts of all the bytes of the input, i.e. the total bitcount of the 32-bit input number. The final >> 24 then simply shifts this value down from the highest byte to the lowest.

Ps. This code could easily be extended to 64-bit integers, simply by changing the 0x01010101 to 0x0101010101010101 and the >> 24 to >> 56. Indeed, the same method would even work for 128-bit integers; 256 bits would require adding one extra shift / add / mask step, however, since the number 256 no longer quite fits into an 8-bit byte.

Henebry answered 28/2, 2014 at 0:54 Comment(3)
"Ps. This code could easily be extended to 64-bit integers, simply by changing the 0x01010101 to 0x0101010101010101 and the >> 24 to >> 56." All other constants need to be extended to 64 bits too: 0x55555555 to 0x5555555555555555, 0x33333333 to 0x3333333333333333, 0x0F0F0F0F to 0x0F0F0F0F0F0F0F0F and, indeed, 0x01010101 to 0x0101010101010101Dubitation
yeer's algorithm given betlow seems to be simpler than the one given above in terms of understanding. Any pointer on efficiency?Boring
@Hari: The OP's code is basically a tweaked version of what yeer posted, with some unnecessary masking operations removed and the last two shift-mask-and-sum steps replaced by a single multiplication and shift. I haven't benchmarked them, but I'd expect the OP's code to be faster on platforms (like most modern x86 CPUs) that have a fast integer multiplier, and slower on those where multiplication is slow. But on such platforms the compiler should be able to optimize multiplication by the constant 0x01010101 into something like three shifts and three adds, so it might still be about the same.Henebry
S
16

I prefer this one, it's much easier to understand.

x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
x = (x & 0x0f0f0f0f) + ((x >> 4) & 0x0f0f0f0f);
x = (x & 0x00ff00ff) + ((x >> 8) & 0x00ff00ff);
x = (x & 0x0000ffff) + ((x >> 16) &0x0000ffff);
Sickert answered 14/4, 2015 at 7:53 Comment(2)
Without any explanation (as given in the other answers), this isn't any more understandable than the question itself.Friction
@OlafDietsche all right. When I look back, it took me a while to recall. But still when you are into the bit operation, it should be not too hard to understand. Try to change it to binary bits. For example: 0x55555555 => 0b 0101010101010101 0101010101010101 0x33333333 => 0b 0011001100110011 0011001100110011 ...Sickert
P
4

This is a comment to Ilamari's answer. I put it as an answer because of format issues:

Line 1:

i = i - ((i >> 1) & 0x55555555);  // (1)

This line is derived from this easier to understand line:

i = (i & 0x55555555) + ((i >> 1) & 0x55555555);  // (2)

If we call

i = input value
j0 = i & 0x55555555
j1 = (i >> 1) & 0x55555555
k = output value

We can rewrite (1) and (2) to make the explanation clearer:

k =  i - j1; // (3)
k = j0 + j1; // (4)

We want to demonstrate that (3) can be derived from (4).

i can be written as the addition of its even and odd bits (counting the lowest bit as bit 1 = odd):

i = iodd + ieven =
  = (i & 0x55555555) + (i & 0xAAAAAAAA) =
  = (i & modd) + (i & meven)

Since the meven mask clears the last bit of i, the last equality can be written this way:

i = (i & modd) + ((i >> 1) & modd) << 1 =
  = j0 + 2*j1

That is:

j0 = i - 2*j1    (5)

Finally, replacing (5) into (4) we achieve (3):

k = j0 + j1 = i - 2*j1 + j1 = i - j1
Possibly answered 1/2, 2017 at 16:28 Comment(0)
P
1

This is an explanation of yeer's answer:

int SWAR(unsigned int i) {
  i = (i & 0x55555555) + ((i >> 1) & 0x55555555);  // A
  i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // B
  i = (i & 0x0f0f0f0f) + ((i >> 4) & 0x0f0f0f0f);  // C
  i = (i & 0x00ff00ff) + ((i >> 8) & 0x00ff00ff);  // D
  i = (i & 0x0000ffff) + ((i >> 16) &0x0000ffff);  // E
  return i;
}

Let's use Line A as the basis of my explanation.

i = (i & 0x55555555) + ((i >> 1) & 0x55555555)

Let's rename the above expression as follows:

i = (i & mask) + ((i >> 1) & mask)
  = A1         + A2 

First, think of i not as 32 bits, but rather as an array of 16 groups, 2 bits each. A1 is the count array of size 16, each group containing the count of 1s at the right-most bit of the corresponding group in i:

i        = yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx
mask     = 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
i & mask = 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x 0x

Similarly, A2 is "counting" the left-most bit for each group in i. Note that I can rewrite A2 = (i >> 1) & mask as A2 = (i & mask2) >> 1:

i                = yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx yx
mask2            = 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
(i & mask2)      = y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0 y0
(i & mask2) >> 1 = 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y 0y

(Note that mask2 = 0xaaaaaaaa)

Thus, A1 + A2 adds the counts of the A1 array and A2 array, resulting in an array of 16 groups, each group now contains the count of bits in each group.

Moving onto Line B, we can rename the line as follows:

i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
  = (i & mask)       + ((i >> 2) & mask)
  = B1               + B2

B1 + B2 follows the same "form" as A1 + A2 from before. Think of i no longer as 16 groups of 2 bits, but rather as 8 groups of 4 bits. So similar to before, B1 + B2 adds the counts of B1 and B2 together, where B1 is the counts of 1s in the right side of the group, and B2 is the counts of the left side of the group. B1 + B2 is thus the counts of bits in each group.

Lines C through E now become more easily understandable:

int SWAR(unsigned int i) {
  // A: 16 groups of 2 bits, each group contains number of 1s in that group.
  i = (i & 0x55555555) + ((i >> 1) & 0x55555555);
  // B: 8 groups of 4 bits, each group contains number of 1s in that group.
  i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
  // C: 4 groups of 8 bits, each group contains number of 1s in that group.
  i = (i & 0x0f0f0f0f) + ((i >> 4) & 0x0f0f0f0f);
  // D: 2 groups of 16 bits, each group contains number of 1s in that group.
  i = (i & 0x00ff00ff) + ((i >> 8) & 0x00ff00ff);
  // E: 1 group of 32 bits, containing the number of 1s in that group.
  i = (i & 0x0000ffff) + ((i >> 16) &0x0000ffff);

  return i;
}
Prudery answered 17/1, 2022 at 16:52 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.