The most efficient way to test if a positive integer is 2^n (i.e. 1, 2, 4, 8, etc.) in C++20?
Asked Answered
P

3

5

A handy method to verify if a positive integer n is a power of two (like 1, 2, 4, 8, etc.) is to use the following test for having no more than 1 bit set:

bool test = n & (n - 1) == 0;

This operation can be very efficient because it only involves subtraction, a bitwise AND and a conditional branch on the Zero Flag (ZF). If this expression is evaluated to true, then the number n is indeed a power of two.

Another method uses the std::popcount (population count) function, which is part of the C++20 standard library, for the test:

bool test = std::popcount(n) == 1; // (Since C++20)

This function counts the number of set bits (1s) in. If the hardware supports a popcount instruction (POPCNT), this function can be very fast.

In C++, you generally “pay for what you use”. For this test there is no use for counting.

What is the better method, in terms of CPU efficiency?

Perilous answered 12/6 at 14:2 Comment(7)
you can measure it. Before you can look at the generated assembly, I wouldnt be surprised if it is identicalUrana
The only real way to know is to benchmark both and compare unless the assembly is the sameDhammapada
What hardware? Which kind of efficiency, memory efficiency, power efficiency, ... ?Rimma
Note that std::popcount(n) == 1 / std::has_single_bit(n) is equivalent to ((n & (n-1)) == 0) && (n != 0)Braille
Like always it depends on your own specific hardware configuration (not only on C++). So you need to measureInveterate
Why does this need to be efficient? What's the range of efficiency you need?Ramage
On x86-64-v3 (which includes BMI1 + BMI2), you have blsr which does n & (n-1) (clearing the lowest set bit) in a single instruction, and setting FLAGS in a useful way for a branch. The FLAGS conditions can even distinguish no bits set (0) from one bit set: CF=1 if the input was zero, ZF=1 if the output is zero (i.e. ZF holds your bool test). You know your integer is positive so zero isn't a possible input, but many use-cases for std::has_single_bit can't assume that.Catechu
C
7

You know your number is positive (which excludes zero) so you can indeed just use n & (n-1) == 0 without checking n != 0. That's your most efficient option, potentially more efficient than C++20 std::has_single_bit

std::has_single_bit needs to rule out the no-bits-set case. For that, it can be slightly more efficient on modern x86 to do popcount(n) == 1 if the compiler can assume support for a hardware popcnt instruction, which is why std::has_single_bit is often defined that way in C++ standard libraries.

But since you know your number is non-zero, the bithack is most efficient. Especially if compiling for a target where the compiler can't assume a hardware popcount (like x86 without -march=x86-64-v2 or newer), or AArch64 before ARMv9.x where scalar popcount requires copying to vector regs and back. RISC-V only has hardware popcount in an uncommon extension, not baseline.

On x86, it can be as cheap as

  lea   eax, [rdi-1]
  test  eax, edi
   # boolean condition in ZF
  jz  or setz  or cmovz  or whatever

And similar on AArch64 with sub and tst. And pretty much any other modern RISC can subtract 1 while putting the result into a separate register, then AND those together cheaply.


And if you're compiling for -march=x86-64-v3 or later (BMI2 + AVX2 + FMA = Haswell feature set), the compiler can use BMI1 blsr eax, edi to clear ("reset") the lowest set bit and set FLAGS accordingly, with ZF set according to the output being zero. CF is set if the input was zero so some branch conditions can check that n!=0. But unfortunately conditions like jbe are CF=1 or ZF=1, ja is CF=0 and ZF=0. There isn't a single FLAGS condition that checks for ZF=1 and CF=0 which would let std::has_single_bit compile to just blsr plus a single branch, cmov, or setcc. ja is taken if the input had multiple bits set, not-taken for power-of-2 or zero.

Unlike test, blsr can't macro-fuse with a later jcc so it doesn't save any uops vs. lea/test if you're branching on it. It is better on Intel if the compiler is using it branchlessly, like for setnz or cmovnz. blsr is a single uop on Intel, and on AMD Zen 4 and later. 2 uops on Zen 3 and earlier (https://uops.info/)


Non-popcount way to exclude zero

For use-cases where you can't assume a non-zero input and don't have cheap hardware popcount, there's an alternate bithack that's 3 operations instead of two: (n - 1) < (n ^ (n - 1))

The right hand side is what x86 BMI1 blsmsk computes, but if we have blsmsk we have popcnt.
n-1 is a common subexpression so we only need to compute it once. For example RISC-V; AArch64 could be similar with cmp/bltu

 addi  x1, x0, -1             # n-1
 xor   x2, x1, x0             # n ^ (n-1)
 bltu  x1, x2, power_of_two   # branch on a comparison

For zero, n-1 is UINT_MAX, and n ^ anything is a no-op, so both sides are equal.

For a power of two, n-1 sets all the bits below where the set bit was, and XOR sets that bit again. So it's larger than n, and also larger than n-1.

For a non-power-of-two, n ^ (n-1) is still just a mask up to and including the lowest set bit, with the high bits cancelled (the ones that n-1 didn't flip). So it's smaller than n-1.

https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2 also suggests v && !(v & (v - 1)); but I don't think that's better since logical && has to check both sides for non-zero.

Catechu answered 18/6 at 1:21 Comment(2)
By the way there is another "popcnt free" trick that does still work if n == 0, namely (n - 1) < (n ^ (n - 1)). The thing on the right can be blsmsk, then in total it can be something like blsmsk, sub, cmp (that's still an extra operation compared to not caring about n == 0 of course)Aniconic
@user555045: n-1 is a common subexpression so you could also do it with lea / xor / cmp. So still 3 instructions even without BMI1, or on ISAs like RISC-V (addi / xor / bgu). Cool, thanks for pointing that out.Catechu
B
15
bool test = std::has_single_bit(n);

std::has_single_bit should be converted to the most efficient sequence for the current platform

template< class T >
constexpr bool has_single_bit( T x ) noexcept;

(since C++20)

Checks if x is an integral power of two.

This overload participates in overload resolution only if T is an unsigned integer type (that is, unsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, or an extended unsigned integer type).

Bahaism answered 12/6 at 14:7 Comment(4)
worth to mention that cppref shows possible implementations in terms of std::popcount(x) == 1; and a variant similar to OPs. Though, its of course just a "possible implementation"Urana
@Urana In my platform, the c++20 library defines std::has_single_bit( T x ) in terms of std::popcount(x) == 1. And the bit banging algorithm I provided below is the logic behind the Assembly for popcount.Cochabamba
@EdwinBuck: A good implementation of std::popcount will use a hardware popcnt instruction when available. Of course, GCC and Clang recognize that SWAR bithack idiom from your answer (and Count the number of set bits in a 32-bit integer), and optimize it as a popcount operation, using a single HW instruction if available. That bithack is what GCC's libgcc.a helper function uses on targets without HW popcount these days (formerly an 8-bit LUT), but writing your own version that way is unwise in C++20 when you could have used std::popcountCatechu
@EdwinBuck: That definition of has_single_bit as popcount(n) == 1 is reasonable if you don't know it's positive (guaranteed non-zero). If you do know that, on x86 at least it's quite a lot cheaper to do n & (n-1) == 0, especially with BMI1 blsr which does that in one instruction. Also AArch64 where popcount requires copying to vector regs and back unless you have ARMv9.x scalar cntCatechu
C
7

You know your number is positive (which excludes zero) so you can indeed just use n & (n-1) == 0 without checking n != 0. That's your most efficient option, potentially more efficient than C++20 std::has_single_bit

std::has_single_bit needs to rule out the no-bits-set case. For that, it can be slightly more efficient on modern x86 to do popcount(n) == 1 if the compiler can assume support for a hardware popcnt instruction, which is why std::has_single_bit is often defined that way in C++ standard libraries.

But since you know your number is non-zero, the bithack is most efficient. Especially if compiling for a target where the compiler can't assume a hardware popcount (like x86 without -march=x86-64-v2 or newer), or AArch64 before ARMv9.x where scalar popcount requires copying to vector regs and back. RISC-V only has hardware popcount in an uncommon extension, not baseline.

On x86, it can be as cheap as

  lea   eax, [rdi-1]
  test  eax, edi
   # boolean condition in ZF
  jz  or setz  or cmovz  or whatever

And similar on AArch64 with sub and tst. And pretty much any other modern RISC can subtract 1 while putting the result into a separate register, then AND those together cheaply.


And if you're compiling for -march=x86-64-v3 or later (BMI2 + AVX2 + FMA = Haswell feature set), the compiler can use BMI1 blsr eax, edi to clear ("reset") the lowest set bit and set FLAGS accordingly, with ZF set according to the output being zero. CF is set if the input was zero so some branch conditions can check that n!=0. But unfortunately conditions like jbe are CF=1 or ZF=1, ja is CF=0 and ZF=0. There isn't a single FLAGS condition that checks for ZF=1 and CF=0 which would let std::has_single_bit compile to just blsr plus a single branch, cmov, or setcc. ja is taken if the input had multiple bits set, not-taken for power-of-2 or zero.

Unlike test, blsr can't macro-fuse with a later jcc so it doesn't save any uops vs. lea/test if you're branching on it. It is better on Intel if the compiler is using it branchlessly, like for setnz or cmovnz. blsr is a single uop on Intel, and on AMD Zen 4 and later. 2 uops on Zen 3 and earlier (https://uops.info/)


Non-popcount way to exclude zero

For use-cases where you can't assume a non-zero input and don't have cheap hardware popcount, there's an alternate bithack that's 3 operations instead of two: (n - 1) < (n ^ (n - 1))

The right hand side is what x86 BMI1 blsmsk computes, but if we have blsmsk we have popcnt.
n-1 is a common subexpression so we only need to compute it once. For example RISC-V; AArch64 could be similar with cmp/bltu

 addi  x1, x0, -1             # n-1
 xor   x2, x1, x0             # n ^ (n-1)
 bltu  x1, x2, power_of_two   # branch on a comparison

For zero, n-1 is UINT_MAX, and n ^ anything is a no-op, so both sides are equal.

For a power of two, n-1 sets all the bits below where the set bit was, and XOR sets that bit again. So it's larger than n, and also larger than n-1.

For a non-power-of-two, n ^ (n-1) is still just a mask up to and including the lowest set bit, with the high bits cancelled (the ones that n-1 didn't flip). So it's smaller than n-1.

https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2 also suggests v && !(v & (v - 1)); but I don't think that's better since logical && has to check both sides for non-zero.

Catechu answered 18/6 at 1:21 Comment(2)
By the way there is another "popcnt free" trick that does still work if n == 0, namely (n - 1) < (n ^ (n - 1)). The thing on the right can be blsmsk, then in total it can be something like blsmsk, sub, cmp (that's still an extra operation compared to not caring about n == 0 of course)Aniconic
@user555045: n-1 is a common subexpression so you could also do it with lea / xor / cmp. So still 3 instructions even without BMI1, or on ISAs like RISC-V (addi / xor / bgu). Cool, thanks for pointing that out.Catechu
C
-2

How readable do you want your code?

   x = x - ((x >> 1) & 0x55555555);
   x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
   x = (x + (x >> 4)) & 0x0F0F0F0F;
   x = x + (x >> 8);
   x = x + (x >> 16);
   if (x == 1) {
   }

This is the bit counting algorithm from "Hacker's Delight" an old treasure trove of useful bit banging algorithms.

   x = x - ((x >> 1) & 0x55555555);

performs a pair-wise bit addition. Note that

000 >>1 = 000, 000 & 101 = 0
001 >>1 = 000, 000 & 101 = 0
010 >>1 = 001, 001 & 101 = 1
011 >>1 = 001, 001 & 101 = 1
100 >>1 = 010, 010 & 101 = 0
101 >>1 = 010, 010 & 101 = 0
110 >>1 = 011, 011 & 101 = 1
111 >>1 = 011, 011 & 101 = 1

So the ((x >> 1) & 0x555555555) acts like an "is the 2 set" of each two bits, in order within the word.

This means that subtracting the result will subtract 1 if the 2 is set in each pair, or subtract nothing if it is not set. In each pair of bits, you now have the count of that pair, stored where the pair used to be.

The other lines then do pairwise summation, adding up the counts of each pair of "2 bits", adding up the counts of each pair of "4 bits", adding up the count of each pair of "8 bits" and so on. If you need to do two 64 bit words, you can add an extra line.

Finally, to see if you have one bit set, you check to see if the count is equal to 1.

Note that while this is fun to consider, it's better to use std::has_single_bit, which uses this same algorithm (but parameterized to handle different length words, and checks that the bit count (the algorithm I'm talking about) is equal to 1. The above solution is for 32 bit words, but can easily be altered for 64 bit words.

https://github.com/gcc-mirror/gcc/blob/bd3a312728fbf8c35a09239b9180269f938f872e/libstdc%2B%2B-v3/include/std/bit#L325

Hacker's delight https://learning.oreilly.com/library/view/hackers-delight-second/9780133084993/

Cochabamba answered 12/6 at 14:35 Comment(9)
Everyone, I'm not advocating this in production code, but sometimes it's fun to lift the covers and see how these things are done.Cochabamba
It is also not an answer to this question, because it clearly isn't efficient (wrt to the already proposed solutions by OP)Inveterate
@PepijnKramer It is exactly the same number of CPU operations as the above (so the logic is exactly the same efficiency, being x MMU operations for a 2^x sized number, and one comparison operation), but this solution is without the creating of a function call stack and destruction of a call stack, so it is actually more efficient. So, in a sense it is more efficient, but I still would ask that people use the std::has_single_bit because it contains a name that makes the function obvious as to what it does. --- To date, this algorithm hasn't been beaten, yet.Cochabamba
I think my thinking is biased here (I apologize). I mostly use x86 processors, the POPCNT instruction on has a 3-cycle latency / 1-cycle throughput (so that's pretty efficient). And yes it is a pretty cool algorithm :)Inveterate
the most readable version is bool test = n & (n - 1) == 0; as in the OP. Counting bits then check if it's 1 is vastly inefficientBahaism
@Bahaism n & n-1 == 0 doesn't even get close to checking if only one bit is set, for example 0x08 & 0x07 = 0x0F which isn't equal to 1, when 0x08 is obviously just one bit set.Cochabamba
@EdwinBuck 0x08 & 0x07 = 0x0F? Huh? 0b01000 & 0b00111 has no overlapping bits so it's 0. 0x08 | 0x07 would be 0x0FTracy
@AndrewHenle You are right, it would be 0. I guess I was a bit tired, and too quick to answer, I used bitwise and instead of bitwise or.Cochabamba
If you don't have a hardware popcount, you definitely don't want to do a full popcount bithack just to test if it's 1 or not. In C++20 it makes no sense to hand-write a bithack popcount. Except maybe to work around a bad implementation of std::popcount when compiling for CPUs without a native popcnt instruction. But still only in cases where you did need a full popcount, not one iteration of the clear-lowest-set-bit bithack, n & (n-1) (e.g. x86 BMI1 blsr which even sets FLAGS in a way that lets you detect if exactly 1 bit was set, vs. 0 or 1 bits set if n&(n-1) == 0.)Catechu

© 2022 - 2024 — McMap. All rights reserved.