Algorithm for finding the smallest power of two that's greater or equal to a given value
Asked Answered
O

17

56

I need to find the smallest power of two that's greater or equal to a given value. So far, I have this:

int value = 3221; // 3221 is just an example, could be any number
int result = 1;

while (result < value) result <<= 1;

It works fine, but feels kind of naive. Is there a better algorithm for that problem?


Related: Rounding up to next power of 2 has some C answers; C++20 std::bit_ceil() isn't available in C, so the ideas could be useful for older C++ code, too.

Most of the answers to this question predate C++20, but could still be useful if implementing a C++ standard library or compiler.

Also related: language-agnostic Given an integer, how do I find the next largest power of two using bit-twiddling? has a C++17 constexpr answer using GNU extensions.

Owen answered 13/12, 2008 at 8:8 Comment(9)
The application is in C++, but I can take some C# or Java too, so feel free to use your favorite.Owen
They're good answers too in stackoverflow.com/questions/466204/…Botel
Updated link to lookup table solution mentioned in Sparr's answer: graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup That gives a value of r = 0 to 31, then do 1 << r to get the power of 2.Armoire
Answer in python: https://mcmap.net/q/19995/-given-an-integer-how-do-i-find-the-next-largest-power-of-two-using-bit-twiddlingSplenetic
This is not a duplicate, this can be done now in standard c++20 with the bit header. The same answer cannot be given to a c question.Drusie
@JakeSchmidt: How would it be done?Gathers
@MikeSmith std::bitceilDrusie
@JakeSchmidt: Correction: std::bit_ceil.Buckle
I have reopened this question, because the proposed duplicate is for a different programming language.Mogilev
P
69

Here's my favorite. Other than the initial check for whether it's invalid (<0, which you could skip if you knew you'd only have >=0 numbers passed in), it has no loops or conditionals, and thus will outperform most other methods. This is similar to erickson's answer, but I think that my decrementing x at the beginning and adding 1 at the end is a little less awkward than his answer (and also avoids the conditional at the end).

/// Round up to next higher power of 2 (return x if it's already a power
/// of 2).
inline int
pow2roundup (int x)
{
    if (x < 0)
        return 0;
    --x;
    x |= x >> 1;
    x |= x >> 2;
    x |= x >> 4;
    x |= x >> 8;
    x |= x >> 16;
    return x+1;
}

An answer on Given an integer, how do I find the next largest power of two using bit-twiddling? presents some explanation of how this common algorithm works, and examples of the bit-patterns for a couple inputs. (That versions uses unsigned, which allows avoiding the x<0 check and is generally better as discussed in comments.)

The same dec / shift/OR / inc strategy is found in:

Pentimento answered 13/12, 2008 at 10:8 Comment(23)
Except for minor syntax differences and the additional of that initial check, yours is also nearly identical to the version given in Henry S. Warren, Jr.'s "Hacker's Delight.".Graphology
@Boojum: Thanks for mentioning that book. I've checked it out and it's got the solution I need, plus much more!Owen
Maybe this solution will be slower than doing it in Assembler, but it has the advantage of being completely portable.Owen
So was your solution, @Boyan, and I think yours was a little more readable.Awhirl
@Pax: Yes, but this one executes in constant time. And it's not that hard to understand, especially if you run it two-three times through the debugger :)Owen
Wouldn't "if (x <= 0)" be better than "if (x < 0)" ?Hypaethral
I agree, this is a better version of my approach (which I deleted). +1!Zampino
Well, I guess it's time to call it a day. Probably can't get more efficient than that solution. Thanks to all!Owen
@Boyan: This solution is not portable e.g., how does it work on x64? (it doesn't)Linger
@J.F.: It works with a minor change. Could be done for both architectures with conditional compilation.Owen
A lookup table based approach can be up to twice as fast as this one.Esophagus
Actually, this algorithm resembles what a multiply instruction does. I wonder if there's a way to make use of that?Vaughn
@Esophagus But do you really want to store 4GB of data for a uint32_t?Spigot
@Spigot I'm thinking of a table with 31 entries.Esophagus
@Esophagus Ah. Is tbl[n] really faster than 1<<n? I suppose it would depend a lot on cache characteristics...Spigot
We probably need some rigorous benchmarking to find out. I reckon that in a real application, the call pattern of this function will bear out whether @Sparr's LUT will be present in cache long enough to keep it ahead. The bit shifting approach is just a simple series of instructions, but the LUT is a good number less instructions but it has a moderately sized data segment. The LUT's data segment (the LUT itself...) is larger in size than the bit shifting approach's instruction count. So, the cost that comes with that LUT is the segmentation of the information into lookup data and instructions.Tympanum
hackersdelight.org/hdcodetxt/clp2.c.txtNoonberg
Why return 0? 0 is not a power of 2, so I think you should return 1 instead (2^0).Oneeyed
In addition to Hacker's Delight, as mentioned by @Boojum, this solution is also found almost verbatim in Bit Twiddling Hacks (2001; graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2) where it's credited to Sean Anderson, and even earlier in a Usenet thread (groups.google.com/forum/#!topic/comp.lang.python/xNOq4N-RffU) from 1997 by Pete Hart and William Lewis.Schematic
I am disappointed that no one has bothered to explain why this works. In short, any number $x > 0$ will have at least one non-zero bit. Let $j$ be the index of the highest order set bit in $x$. Then the shift operations will set all bits at indices [0,j] to 1. Adding 1 returns $2^{j+1}$. This is exactly what we want except for when $x$ is already a power of 2. So we subtract one at the beginning.Rijeka
Some problems here. It has undefined behaviour for zero input due to signed integer overflow. It also crucially depends on the size of an int. Both these problems can be solved simply by changing int to uint32_t. Although you could argue that the function should not actually be defined for x==0 or at least give the more correct answer of 1.Scopp
@nattgris: For an input of 0, the x-- step produces -1. That's unsigned wraparound of the 2's complement bit-pattern, but it's not signed overflow. The right shifts are required to be implementation-defined for negative signed integers; sane implementations pick arithmetic right shift; it would also be legal to pick logical right shift. But on a 2's complement system the bit-pattern necessarily stays all-one because we OR with the original -1. Then x++ goes from -1 to 0, in the middle of the signed range, nowhere near signed overflow.Grimes
@nattgris: The would be UB for INT_MIN, where x-- overflows if it didn't take an early-out on negative x. So mostly the gain from using uint32_t is efficiency and simplicity, and value-range. And lack of dependency on 2's complement signed integers at least for the 0 input case. 100% agreed it's not good to use int here, any unsigned type would be better, but the problems don't include UB.Grimes
L
21
ceil(log2(value))

ilog2() can be calculated in 3 asm instructions e.g., http://www.asterisk.org/doxygen/1.4/log2comp_8h-source.html

Linger answered 13/12, 2008 at 8:18 Comment(8)
That's something I considered, but won't it be slower than my bit-pushing solution?Owen
I think this will be slower than the loop mentioned in the questionDiscourse
There is no slow feature in this but rather, taking the result of log2(value) and rounding it up to the nearest whole number cannot be beaten in efficiency when the log table lookups are all pre-computedGironde
Log-base-2 generally takes a float as an argument. Are you saying you have a lookup table with an entry for every possible float? I hope not... Of course, the fastest way is a lookup table with 2^32 entries but that's a bit memory-expensive.Awhirl
@J.F.: If we throw in some Assembler, your solution really looks better. Thanks for the link!Owen
You mean 1<<ceil(log2(value))?Waltman
The link is broken...Vise
Note it's only really 1 asm instruction, the 3 instructions for x86 are to handle the special case of 0 as an input, taking advantage of the fact that all real CPUs leave BSR's destination unmodified when the input is 0. (AMD documents this, Intel does it too but documents the result as an undefined value.) It generates -1 with a funky code-size optimization for 32-bit mode, using xor-zeroing and dec instead of a more efficient mov $-1, %eax for 2 insns. Modern x86 can use lzcnt, but lzcnt(0) ==32, instead of the -1 that this code wants. 1<<-1 on x86 is 1<<31; it masksGrimes
C
12

On Intel hardware the BSR instruction is close to what you want - it finds the most-significant-set-bit. If you need to be more precise you can then wonder if the remaining bits are precisely zero or not. I tend to assume that other CPU's will have something like BSR - this is a question you want answered to normalize a number. If your number is more than 32 bits then you would do a scan from your most-significant-DWORD to find the first DWORD with ANY bits set. Edsger Dijkstra would likely remark that the above "algorithms" assume that your computer uses Binary Digits, while from his kind of lofty "algorithmic" perspective you should think about Turing machines or something - obviously I am of the more pragmatic style.

Cathleencathlene answered 13/12, 2008 at 8:39 Comment(8)
Yes, maybe I'll do some Assembler. After finding the most significant set bit, I think I can do: if (~most_sign_bit & value) to find if I have to left-shift the value once. Is that correct?Owen
I looked in MSDN and there is a compiler instrinsic called _BitScanReverse() - this is better than assembler because you can't do inline assember in x64 and you don't want to waste a procedure call to an external x64 routine. Assuming you're using MS compilers of course.Cathleencathlene
The (~MSB & value ) sounds perfect - of course single-stepping will tell for sure !Cathleencathlene
There is a bit of cleanup to do since both 4 and 5 will return 2 for the MSB, while the right answer for for those values are 4 and 8 respectively. Nevertheless, I like the BSR solution -- I tend to forget about that instruction.Tildi
@DocMax: Yes, that's why I'll use (~MSB & value) after BSR to find out if I need one left shift.Owen
Ah, my mistake. I had misread the approach. I do not know what you are thinking for the check, but you can avoid a branch by using BSF like: mov eax,x; bsr ecx,eax; bsf edx,eax; sub edx,ecx; adc ecx,0; mov ebx,1; shl ebx,cl. (And assembly on a single line reads like a chess game. :)Tildi
pow2roundup() above with the 8 votes solves exactly the prb posed. To create a normalized FP number you must still find the index of the MSB, so BSR does have a mission, but it's work you don't need if ALL you need is roundup - must admit I didn't know of pow2roundup() - clever !Cathleencathlene
Google Query (( BSR XEON CLOCK )) has 3rd hit "Nearest power of 2 - GameDev.Net Discussion Forums". These guys benched BSR and a Flt-Pt-Trick, evidently such things are constant-time on XEON. _BitScanReverse() is not mentioned - inlining matters as the actual operation is so fast on such machines.Cathleencathlene
E
12

In the spirit of Quake II's 0x5f3759df and the Bit Twiddling Hacks' IEEE version - this solution reaches into a double to extract the exponent as a means to calculate floor(lg2(n)). It's a bit faster than the accepted solution and much faster than the Bit Twiddling IEEE version since it avoids floating point math. As coded, it assumes a double is a real*8 IEEE float on a little endian machine.

int nextPow2(int n) 
{ 
    if ( n <= 1 ) return n;
    double d = n-1; 
    return 1 << ((((int*)&d)[1]>>20)-1022); 
} 

Edit: Add optimized x86 assembly version with the help of a co-worker. A 4% speed gain but still about 50% slower than a bsr version (6 sec vs 4 on my laptop for n=1..2^31-2).

int nextPow2(int n) 
{ 
    if ( n <= 1 ) return n;
    double d;
    n--;
    __asm {
      fild    n 
      mov     eax,4
      fstp    d 
      mov     ecx, dword ptr d[eax]
      sar     ecx,14h 
      rol     eax,cl 
  }
} 
Exine answered 20/1, 2009 at 19:35 Comment(4)
Is this really efficient comparing to BSR instruction?Amorous
@Explorer09: No, not remotely. The pure C version (with strict-aliasing violation) is just a way to take advantage of the leading-zero finding of int->FP conversion on a C implementation without intrinsics for bit-scan instructions like GNU C __builtin_clz. bsr reg,reg is 1 uop, 3c latency on Intel. uops.info (Or 6 uops with 4c latency, 4c throughput on AMD Zen 3 for example; lzcnt is fast on AMD and same speed on Intel, so prefer that if you can.) This is >6 single-uop instructions including store/reload latency, including several count outside the inline-asm block.Grimes
@Explorer09: If ISO C would stop sucking and provide portable functions for this stuff, we wouldn't have any reason to hack up horrible code like this. Platforms with an FPU but without integer bit-scan instructions might use something like this as the implementation as a worst case, but platforms like x86 could just use BSR. ISO C++20 finally got around to doing this with std::countl_zero(T).Grimes
@Explorer09: So TL:DR: if you're going to use x86 inline asm at all, it's very silly to keep using the FPU instead of mov ecx, n / bsr ecx, ecx (not bsr ecx, n, that would have a false dependency on the old value of n), and adapt it to use the bit-index instead of the double exponent. Or better, use intrinsics for BSR and ROL to avoid inline asm entirely, especially MSVC style that forces spilling inputs to memory because the syntax sucks too much to have them in registers.Grimes
J
8

Here's a template version of the bit shifting technique.

template<typename T> T next_power2(T value)
{
    --value;
    for(size_t i = 1; i < sizeof(T) * CHAR_BIT; i*=2)
        value |= value >> i;
    return value+1;
}

Since the loop only uses constants it gets flattened by the compiler. (I checked) The function is also future proof.

Here's one that uses __builtin_clz. (Also future proof)

template<typename T> T next_power2(T value)
{
    return 1 << ((sizeof(T) * CHAR_BIT) - __builtin_clz(value-1));
}
Jeroboam answered 18/3, 2013 at 14:26 Comment(2)
If T is unsigned long or unsigned long long, you'll want __builtin_clzll and 1ULL << ...Grimes
In C++20, use std::bit_width (index of MSB = x86 BSR = 32-clz. In fact, C++20 even has std::bit_ceil which is the answer to this whole question. (Its example implementation uses T(1) << ... to make sure we don't overflow shifting an int 1 before conversion to T.)Grimes
A
6

Your implementation is not naive, it's actually the logical one, except that it's wrong - it returns a negative for numbers greater that 1/2 the maximum integer size.

Assuming you can restrict numbers to the range 0 through 2^30 (for 32-bit ints), it'll work just fine, and a lot faster than any mathematical functions involving logarithms.

Unsigned ints would work better but you'd end up with an infinite loop (for numbers greater than 2^31) since you can never reach 2^32 with the << operator.

Awhirl answered 13/12, 2008 at 8:24 Comment(4)
Yes, the values would be greater than zero and much less than 2^31.Owen
Then what you have is about as fast as it's going to get. I don't doubt there's a boolean algebra solution that'll do it in 2 or three operations, but you'd sacrifice lots of readability for a very small performance gain.Awhirl
I think you might be able to speed it up using a binary search: Initialize result with the (statically) computed middle of the number range and shift left / right. The more steps you precompute on this search tree, the lower the average shift-amount becomesEpiscopacy
I just dont know if this really will be faster, because for many ranges the solution up there could be faster despite the higher amount of shifts.Episcopacy
H
5

pow ( 2 , ceil( log2(value) );

log2(value) = log(value) / log(2);

Hearsay answered 1/4, 2013 at 22:37 Comment(0)
E
4

An exploration of the possible solutions to closely related problem (that is, rounding down instead of up), many of which are significantly faster than the naive approach, is available on the Bit Twiddling Hacks page, an excellent resource for doing the kinds of optimization you are looking for. The fastest solution is to use a lookup table with 256 entries, that reduces the total operation count to around 7, from an average of 62 (by a similar operation counting methodology) for the naive approach. Adapting those solutions to your problem is a matter of a single comparison and increment.

Esophagus answered 17/12, 2008 at 1:42 Comment(1)
Link is out-of-date. Newer link: graphics.stanford.edu/~seander/bithacks.html#IntegerLogLookup That gives a value of r = 0 to 31, then do 1 << r to get the power of 2.Armoire
H
3

You don't really say what you mean by "better algorithm" but as the one you present is perfectly clear (if somewhat flawed), I'll assume you are after a more efficient algorithm.

Larry Gritz has given what is probably the most efficient c/c++ algorithm without the overhead of a look up table and it would suffice in most cases (see http://www.hackersdelight.org for similar algorithms).

As mentioned elsewhere most CPUs these days have machine instructions to count the number of leading zeroes (or equivalently return the ms set bit) however their use is non-portable and - in most cases - not worth the effort.

However most compilers have "intrinsic" functions that allow the use of machine instructions but in a more portable way.

Microsoft C++ has _BitScanReverse() and gcc provides __builtin_clz() to do the bulk of the work efficiently.

Hypaethral answered 13/12, 2008 at 13:10 Comment(0)
H
3

How about a recursive template version to generate a compile constant:

template<uint32_t A, uint8_t B = 16>
struct Pow2RoundDown { enum{ value = Pow2RoundDown<(A | (A >> B)), B/2>::value }; };
template<uint32_t A>
struct Pow2RoundDown<A, 1> { enum{ value = (A | (A >> 1)) - ((A | (A >> 1)) >> 1) }; };

template<uint32_t A, uint8_t B = 16>
struct Pow2RoundUp { enum{ value = Pow2RoundUp<((B == 16 ? (A-1) : A) | ((B == 16 ? (A-1) : A) >> B)), B/2>::value }; };
template<uint32_t A >
struct Pow2RoundUp<A, 1> { enum{ value = ((A | (A >> 1)) + 1) }; };

Can be used like so:

Pow2RoundDown<3221>::value, Pow2RoundUp<3221>::value
Hubbell answered 18/2, 2014 at 1:33 Comment(0)
D
3

In standard C++20, a template in <bit> does this: cppreference.

#include <bit>
unsigned long upper_power_of_two(unsigned long v)
{
    return std::bit_ceil(v);
}

Only unsigned integer types participate in overload resolution if you don't use the bit_ceil<T> template parameter explicitly.

Beware that bit_ceil has undefined behaviour if the result isn't representable in the input type, not just a garbage result.
This applies even for unsigned integer types, where arithmetic is well-defined to wrap.

For example, std::bit_ceil(-123) will implicitly convert that signed int input to unsigned, so it will operate on -123u, e.g. 0xffffff85u on a system with 32-bit int. The correct result would take 33 bits, more than the width of unsigned, so the behaviour is undefined.

This is true for negative inputs in general on 2's complement systems, except for INT_MIN / LONG_MIN etc. which have the same bit-pattern as 1u<<(n-1), i.e. 2**(n-1)

Drusie answered 26/2, 2023 at 20:31 Comment(0)
T
1

The code below repeatedly strips the lowest bit off until the number is a power of two, then doubles the result unless the number is a power of two to begin with. It has the advantage of running in a time proportional to the number of bits set. Unfortunately, it has the disadvantage of requiring more instructions in almost all cases than either the code in the question or the assembly suggestions. I include it only for completeness.

int nextPow(int x) {
  int y = x
  while (x &= (x^(~x+1))) 
    y = x << 1;
  return y
}
Tildi answered 13/12, 2008 at 9:34 Comment(0)
V
1

I know this is downvote-bait, but if the number is small enough (like 8 or 16-bits) a direct lookup might be fastest.

// fill in the table
unsigned short tab[65536];
unsigned short bit = tab[i];

It might be possible to extend it to 32 bits by first doing the high word and then the low.

//
unsigned long bitHigh = ((unsigned long)tab[(unsigned short)(i >> 16)]) << 16;
unsigned long bitLow = 0;
if (bitHigh == 0){
    bitLow = tab[(unsigned short)(i & 0xffff)];
}
unsigned long answer = bitHigh | bitLow;

It's probably no better that the shift-or methods, but maybe could be extended to larger word sizes.

(Actually, this gives the highest 1-bit. You'd have to shift it left by 1 to get the next higher power of 2.)

Vaughn answered 17/12, 2008 at 1:57 Comment(3)
Using a 256-entry lookup table and using the results for all 4 bytes of a 4-byte word is quite viable. The memory/speed tradeoff to use a 65536 entry table is not great (14% speedup, 25500% memory increase),Esophagus
@Sparr. You're right. It's hard to beat the shift-or method, but it's interesting to keep an eye out for more ways to skin cats.Vaughn
Once I heard a Dijkstra lecture on fun algorithms. He had a student who did an n-bit rotate by doing a sequence of 3 bit-reversals.Vaughn
L
1

My version of the same:

int pwr2Test(size_t x) {
    return (x & (x - 1))? 0 : 1; 
}

size_t pwr2Floor(size_t x) {
    // A lookup table for rounding up 4 bit numbers to
    // the nearest power of 2.
    static const unsigned char pwr2lut[] = {
        0x00, 0x01, 0x02, 0x02,     //  0,  1,  2,  3
        0x04, 0x04, 0x04, 0x04,     //  4,  5,  6,  7
        0x08, 0x08, 0x08, 0x08,     //  8,  9, 10, 11
        0x08, 0x08, 0x08, 0x08      // 12, 13, 14, 15
    };

    size_t pwr2 = 0;                // The return value
    unsigned int i = 0;             // The nybble interator

    for( i = 0; x != 0; ++i ) {     // Iterate through nybbles
        pwr2 = pwr2lut[x & 0x0f];   // rounding up to powers of 2.
        x >>= 4;                    // (i - 1) will contain the
    }                               // highest non-zero nybble index.

    i = i? (i - 1) : i;
    pwr2 <<= (i * 4);
    return pwr2; 
}

size_t pwr2Size(size_t x) {
    if( pwr2Test(x) ) { return x; }
    return pwr2Floor(x) * 2; 
 }
Leddy answered 12/5, 2010 at 23:47 Comment(0)
N
0

i love the shift.

i'll settle for

    int bufferPow = 1;
    while ( bufferPow<bufferSize && bufferPow>0) bufferPow <<= 1;

that way the loop always terminates, and the part after && is evaluated almost never. And i do not think two lines are worth a function call. Also you can make a long, or short, depending on your judgment, and it is very readable. (if bufferPow becomes negative, hopefully your main code will exit fast.)

Usually you compute 2-power only once at the start of an algorithm, so optimizing would be silly anyway. However, would be interested if anyone bored enough would care for a speed contest... using the above examples and 255 256 257 .. 4195 4196 4197

Newmown answered 20/8, 2012 at 19:11 Comment(0)
V
0

An arbitrary log function can be converted to a log base 2 by dividing by the log of 2:

$ /usr/local/pypy-1.9/bin/pypy
Python 2.7.2 (341e1e3821ff, Jun 07 2012, 15:38:48)
[PyPy 1.9.0 with GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
And now for something completely different: ``<arigato> yes but there is not
much sense if I explain all about today's greatest idea if tomorrow it's
completely outdated''
>>>> import math
>>>> print math.log(65535)/math.log(2)
15.9999779861
>>>> print math.log(65536)/math.log(2)
16.0
>>>>

It of course won't be 100% precise, since there is floating point arithmetic involved.

Variegate answered 21/8, 2012 at 17:30 Comment(1)
You can also do print math.log(65535,2)Latt
N
-2

This works and is really fast (on my 2.66 GHz Intel Core 2 Duo 64-bit processor).

#include <iostream>
int main(void) {
    int testinput,counter;
    std::cin >> testinput;
    while (testinput > 1) {
        testinput = testinput >> 1;
        counter++;
    }
    int finalnum = testinput << counter+1;
    printf("Is %i\n",finalnum);
    return 0;
}

I tested it on 3, 6, and 65496, and correct answers (4, 8, and 65536) were given.

Sorry if this seems a bit arcane, I was under the influence of a couple of hours of Doom just before writing. :)

Naphtha answered 20/8, 2011 at 18:16 Comment(1)
This looks worse than what the OP started with, not to mention some other algorithms presented hereBytom

© 2022 - 2024 — McMap. All rights reserved.