Fastest way of computing the power that a "power of 2" number used?
Asked Answered
C

6

9

What would be the quickest way to find the power of 2, that a certain number (that is a power of two) used?

I'm not very skilled at mathematics, so I'm not sure how best to describe it. But the function would look similar to x = 2^y where y is the output, and x is the input. Here's a truth table of what it'd look like if that helps explain it.

0 = f(1)
1 = f(2)
2 = f(4)
3 = f(8)
...
8 = f(256)
9 = f(512)

I've made a function that does this, but I fear it's not very efficient (or elegant for that matter). Would there be a simpler and more efficient way of doing this? I'm using this to compute what area of a texture is used to buffer how drawing is done, so it's called at least once for every drawn object. Here's the function I've made so far:

uint32 getThePowerOfTwo(uint32 value){
    for(uint32 n = 0; n < 32; ++n){
        if(value <= (1 << n)){
            return n;
        }
    }
    return 32; // should never be called
}
Camacho answered 29/1, 2014 at 17:45 Comment(14)
Try a logarithm.Splenectomy
I do agree on logarithm usage. But do try to write the binary representation of the numbers you want your function to work with. hint hintGrowth
"Here's the function I've made so far:" - I think that function is just fine. Do not worry about efficiency. Also, shifting the 1 bit until you find the result greater than your number is perfectly "elegant" in the sense that it's easy to read/understand and implement.Manchukuo
BTW, you should learn a bit of elementary maths. (You won't get too far with programing if you don't recognize a logarithm.)Manchukuo
@Growth - I edited in the representation, thanks. (As for logarithm usage, I suppose it's essentially a log of 2, but using std::log would be more costly, I assume)Camacho
@Clairvoire Again, do not worry about efficiency at all unless you have a proper benchmark that tells you that this is the code that makes your program slow. Just forget about speed, especially while you are learning the language. The biggest problem with std::log is that it uses the programmer's #1 enemy: floating-point numbers. As a result, it often returns incorrect results.Manchukuo
@H2CO3 - Actually, that's a part of my concern with log, I try to avoid floating point when working with integers for that reason. I'm quite familiar with the language though, I just don't know many math terms by heart, is all. My concern was just that there might've been a common efficient solution to this common problem, and I was 'reinventing the wheel' lopsidedly.Camacho
@Clairvoire If you know about the floating-point disaster, then that's fine.Manchukuo
@Clairvoire Just FYI, the code in the answer you accepted does not terminate.Manchukuo
@H2CO3 - It would terminate once value is right shifted all the way out (thus making it zero)? edit-Nevermind, I see now, it was in a contitional statement. Nice catch!Camacho
@Clairvoire Not quite. It has just been fixed.Manchukuo
Edit: it seems it hasn't been fixed at all.Manchukuo
Depending on your use case, allowing the compiler to do the work for you may be the fastest solution (see Sebastian's answer). Of course, this assumes you can use C++11, and only care about the values at compile time rather than run time.Containerize
graphics.stanford.edu/~seander/bithacks.html#IntegerLogObviousProfiteer
S
12

Building on woolstar's answer - I wonder if a binary search of a lookup table would be slightly faster? (and much nicer looking)...

int getThePowerOfTwo(int value) {
    static constexpr int twos[] = {
        1<<0,  1<<1,  1<<2,  1<<3,  1<<4,  1<<5,  1<<6,  1<<7,
        1<<8,  1<<9,  1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<15,
        1<<16, 1<<17, 1<<18, 1<<19, 1<<20, 1<<21, 1<<22, 1<<23,
        1<<24, 1<<25, 1<<26, 1<<27, 1<<28, 1<<29, 1<<30, 1<<31
    };

    return std::lower_bound(std::begin(twos), std::end(twos), value) - std::begin(twos);
}
Salpingitis answered 29/1, 2014 at 18:25 Comment(15)
+1 if you can figure out how to make it work for any arbitrary sized value.Rothschild
If I could select two answers, I would have selected this one as well.Camacho
@Clairvoire Instead, you decided to select a superfluously clever one, that was so clever that even its author couldn't get it right.Manchukuo
@H2CO3 - I picked it based on it's logic and content. I haven't had the chance to test or profile it yet. If it ends up not working, I can just reassign the answer. As for "superfluously clever", I suppose that just depends on who's reading it. To a non-programmer, all of C++ could be called that.Camacho
@Clairvoire no, it's quite objective. We are programmers, so that's a non sequitur. You see, it was so "clever" that it took two edits for the author to even make it correct. I am benchmarking it against the naive implementation right now and so far it doesn't seem any faster.Manchukuo
@H2CO3 Is this one faster than the naive implementation? I'd be interested to see the timings you measureSalpingitis
@Dave Yes, this is as fast as the "clever" approach presented above.Manchukuo
@Clairvoire So, I suggest you accept this answer instead. It is just as fast as the approach in the currently accepted answer, yet it's crystal clear and simplistic.Manchukuo
@H2CO3 I have just profiled all three functions, after feeding them an array of 10000 random powers of 2 (ranging from 0 to 15) with optimizations enabled, a few different times. My function consumed an average of 0.55ms, this one consumed 0.85ms, and woolstar's consumed 0.50ms. It's clearer, granted! But if that becomes a concern, I can fallback on the one I wrote which is fairly clear and has comparable speed to woolstar's log2()Camacho
@Clairvoire 10000 is not quite enough. Make a loop of 100 million iterations. By the way, if you only need numbers from 0 to 15, then the naive approach is always going to be faster, since it starts the right end of the bit pattern. In my most recent test on numbers 0...15, a naive version took 306075 microseconds to complete 100 million cycles, whereas the clever version run in 1553767 microseconds - so the "optimized" code was 5 times slower than the simple one. And if you think about how they work, you'll see why. All in all, we can conclude that this particular function...Manchukuo
@Clairvoire is not a candidate for micro-optimizations.Manchukuo
@H2CO3 - I think I see the reason for the dependency between our results. My test sample wasn't completely random, but instead used pow(2, randomElement%16) meaning it weighed numbers closer to zero more greatly. Just imputing pure random numbers from 0 to 2^16 though, made this the better function because most of the numbers are in the later exponent range. GRANTED overall, it means nothing. But it's good to have a definitive answer, for curiosity's sake. I'll accept this answer since it's performance was more consistent, regardless of the range.Camacho
@Clairvoire This supports 32bit, not just 16bit. You should see a greater difference over the naive implementation as you use a greater range of numbers (up to 2^32)Salpingitis
@Dave By the way, here's a benchmark where the naive approach beats the ship out of the two others: link to CoLiRuManchukuo
@H2CO3 Ya, in that case the naive implementation will always do 0-4 iterations. The binary search versions will do 3-5, and they have more constant overhead too. Binary search isn't so useful when you are only searching through 4 things lol...Salpingitis
A
5

This operation is sufficiently popular for processor vendors to come up with hardware support for it. Check out find first set. Compiler vendors offer specific functions for this, unfortunately there appears to be no standard how to name it. So if you need maximum performance you have to create compiler-dependent code:

# ifdef __GNUC__  
    return __builtin_ffs( x ) - 1; // GCC
#endif
#ifdef _MSC_VER
    return CHAR_BIT * sizeof(x)-__lzcnt( x ); // Visual studio
#endif
Atlantes answered 29/1, 2014 at 18:10 Comment(0)
W
2

If input value is only 2^n where n - integer, optimal way to find n is to use hash table with perfect hash function. In that case hash function for 32 unsigned integer could be defined as value % 37

template < size_t _Div >
std::array < uint8_t, _Div > build_hash()
{
    std::array < uint8_t, _Div > hash_;

    std::fill(hash_.begin(), hash_.end(), std::numeric_limits<uint8_t>::max());

    for (size_t index_ = 0; index_ < 32; ++index_)
        hash_[(1 << index_) % _Div] = index_;

    return hash_;
}

uint8_t hash_log2(uint32_t value_)
{
    static const std::array < uint8_t, 37 > hash_ = build_hash<37> ();

    return hash_[value_%37];
}

Check

int main()
{
    for (size_t index_ = 0; index_ < 32; ++index_)
        assert(hash_log2(1 << index_) == index_);   
}
Wyly answered 30/1, 2014 at 5:12 Comment(0)
R
0

Your version is just fine, but as you surmised, its O(n) which means it takes one step through the loop for every bit. You can do better. To take it to the next step, try doing the equivalent of a divide and conquer:

unsigned int log2(unsigned int value)
{
  unsigned int val = 0 ;
  unsigned int mask= 0xffff0000 ;
  unsigned int step= 16 ;

  while ( value )
  {
    if ( value & mask ) { val += step ;  value &= ~ mask ; }
    step /= 2 ;
    if ( step ) { mask >>= step ; } else { mask >>= 1 ; }
  }

  return val ;
}

Since we're just hunting for the highest bit, we start out asking if any bits are on in the upper half of the word. If there are, we can throw away all the lower bits, else we just narrow the search down.

Since the question was marked C++, here's a version using templates that tries to figure out the initial mask & step:

template <typename T>
  T log2(T val)
  {
    T result = 0 ;
    T step= ( 4 * sizeof( T ) ) ;  // half the number of bits
    T mask= ~ 0L - ( ( 1L << ( 4 * sizeof( T )) ) -1 ) ;

    while ( val && step )
    {
      if ( val & mask ) { result += step ;  val >>= step ; }
      mask >>= ( step + 1) / 2 ;
      step /= 2 ; 
    }

    return result ;
  }

While performance of either version is going to be a blip on a modern x86 architecture, this has come up for me in embedded solutions, and in the last case where I was solving a bit search very similar to this, even the O(log N) was too slow for the interrupt and we had to use a combo of divide and conquer plus table lookup to squeeze the last few cycles out.

Rothschild answered 29/1, 2014 at 17:55 Comment(5)
Yeah, this is nice. In theory. However, I'd be interested in an actual benchmark; I am not sure that this is indeed faster (I know, I know, O(log n) vs O(n), but there are those "constant factors", y'know -- this does more computation than OP's naive version). Also, readability >>> performance until OP doesn't explicitly require a very-very-very fast, fine-tuned version for a specific purpose (after having benchmarked his program and having concluded that this function is a major bottleneck.)Manchukuo
you see, that's why readability >>> performance. Now this makes even more instructions per loop, and you couldn't even get it right for the first time.Manchukuo
@H2CO3 I will agree with you that this is an excellent lesson that optimizing something means more code, and quite frequently more bugs; and it should generally be done only when required.Rothschild
I finally got around to testing this function to see if it's accurate and profile it. It's very slightly faster than my function, by about 10%. It's the fastest answer, but yes, it's at a cost to clarity. I'll leave this as the answer however, since it's the only portable answer that trumps the one I wrote in terms of efficiencyCamacho
I just fiddled with the test using a uniform distribution in the input data, and found Dave's answer was faster. When the distribution is similiar to pow(2, x), this one wins because smaller numbers are weighted more, but not when it's spread out evenly. So I'll have to give accept his answer on this.Camacho
S
0

If you KNOW that it is indeed a power of two (which is easy enough to verify), Try the variant below. Full description here: http://sree.kotay.com/2007/04/shift-registers-and-de-bruijn-sequences_10.html

//table
static const int8 xs_KotayBits[32] =    {
       0,  1,  2, 16,  3,  6, 17, 21,
       14,  4,  7,  9, 18, 11, 22, 26,
       31, 15,  5, 20, 13,  8, 10, 25,
       30, 19, 12, 24, 29, 23, 28, 27
       };


//only works for powers of 2 inputs
static inline int32 xs_ILogPow2 (int32 v){
   assert (v && (v&(v-1)==0));
   //constant is binary 10 01010 11010 00110 01110 11111
   return xs_KotayBits[(uint32(v)*uint32( 0x04ad19df ))>>27];
}     
Slaver answered 31/7, 2019 at 15:43 Comment(0)
D
0

So the fastest will be using builtin functions provided by the compiler. But there are some alternatives with just code.

Benchmarks (100 million interations):

| ------- | unoptimized | optimized |
| loop    | 2.057s      | 0.780s    |
| func1   | 0.749s      | 0.151s    |
| func2   | 0.721s      | 0.000s    |
| builtin | 0.189s      | 0.002s    |

Note: func2 optimized is actually faster than the builtin. I can confirm it executes correctly. That being said, there are many actions that can break the optimizations of func2, unlike builtin, those optimizations seem to be more consistant.

Function 1:

unsigned int f1(unsigned int value) {
    unsigned int n1 = 0;
    unsigned int n2 = 1;
    unsigned int n3 = 2;
    unsigned int n4 = 3;
    unsigned int n5 = 4;
    unsigned int n6 = 5;
    unsigned int n7 = 6;
    unsigned int n8 = 7;
    while((1 << n8) < value) {
        n1 += 8;
        n2 += 8;
        n3 += 8;
        n4 += 8;
        n5 += 8;
        n6 += 8;
        n7 += 8;
        n8 += 8;
    }
    if(value <= (1 << n1)) return n1;
    if(value <= (1 << n2)) return n2;
    if(value <= (1 << n3)) return n3;
    if(value <= (1 << n4)) return n4;
    if(value <= (1 << n5)) return n5;
    if(value <= (1 << n6)) return n6;
    if(value <= (1 << n7)) return n7;
    return n8;
}

Function 2:

unsigned int f2(unsigned int v) {
    if(v <= (1 << 0)) return 0;
    if(v <= (1 << 1)) return 1;
    if(v <= (1 << 2)) return 2;
    if(v <= (1 << 3)) return 3;
    if(v <= (1 << 4)) return 4;
    if(v <= (1 << 5)) return 5;
    if(v <= (1 << 6)) return 6;
    if(v <= (1 << 7)) return 7;
    if(v <= (1 << 8)) return 8;
    if(v <= (1 << 9)) return 9;
    if(v <= (1 << 10)) return 10;
    if(v <= (1 << 11)) return 11;
    if(v <= (1 << 12)) return 12;
    if(v <= (1 << 13)) return 13;
    if(v <= (1 << 14)) return 14;
    if(v <= (1 << 15)) return 15;
    if(v <= (1 << 16)) return 16;
    if(v <= (1 << 17)) return 17;
    if(v <= (1 << 18)) return 18;
    if(v <= (1 << 19)) return 19;
    if(v <= (1 << 20)) return 20;
    if(v <= (1 << 21)) return 21;
    if(v <= (1 << 22)) return 22;
    if(v <= (1 << 23)) return 23;
    if(v <= (1 << 24)) return 24;
    if(v <= (1 << 25)) return 25;
    if(v <= (1 << 26)) return 26;
    if(v <= (1 << 27)) return 27;
    if(v <= (1 << 28)) return 28;
    if(v <= (1 << 29)) return 29;
    if(v <= (1 << 30)) return 30;
    return 31;
}

Yes. These functions look ... euh strange. Both functions are written this way so that the CPU can execute multiple lines at the same time. Function 2 seem to work better with compiler optimizations.

Dotted answered 1/7 at 16:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.