Compute fast log base 2 ceiling
Asked Answered
A

15

44

What is a fast way to compute the (long int) ceiling(log_2(i)), where the input and output are 64-bit integers? Solutions for signed or unsigned integers are acceptable. I suspect the best way will be a bit-twiddling method similar to those found here, but rather than attempt my own I would like to use something that is already well tested. A general solution will work for all positive values.

For instance, the values for 2,3,4,5,6,7,8 are 1,2,2,3,3,3,3

Edit: So far the best route seems to be to compute the integer/floor log base 2 (the position of the MSB) using any number of fast existing bithacks or register methods, and then to add one if the input is not a power of two. The fast bitwise check for powers of two is (n&(n-1)).

Edit 2: A good source on integer logarithms and leading zeroes methods is Sections 5-3 and 11-4 in Hacker's Delight by Henry S. Warren. This is the most complete treatment I've found.

Edit 3: This technique looks promising: https://mcmap.net/q/14723/-compute-fast-log-base-2-ceiling

Edit 4: C23 is apparently adding stdc_first_(leading/trailing)_(one/zero)

Anthocyanin answered 17/7, 2010 at 16:59 Comment(6)
Must be exactly correct for at least all values strictly greater than one and less than a large number, say 2^63 or 2^62.Anthocyanin
Please see my answer below. I put an explanation + the code that will do this for you.Emrick
In the past, I've generally used a combination of lookup tables and bit-twiddling, similar to the ones in the link to the bit-twiddling page.Monagan
If you're dealing with positive values only, a simple way to handle the rounding is to find the most significant bit set for ((x << 1) - 1). You'd need to special-case x == 0, and you'll overflow if the top bit is set, but this method might be faster than some of the other rounding techniques presented.Kymberlykymograph
in C++20 just use std::bit_ceil, unfortunately this question is about CAnatomical
sooo, you're gonna want to get a really big hard drive and make a big table.Pusillanimity
S
23

This algorithm has already been posted, but the following implementation is very compact and should optimize into branch-free code.

int ceil_log2(unsigned long long x)
{
  static const unsigned long long t[6] = {
    0xFFFFFFFF00000000ull,
    0x00000000FFFF0000ull,
    0x000000000000FF00ull,
    0x00000000000000F0ull,
    0x000000000000000Cull,
    0x0000000000000002ull
  };

  int y = (((x & (x - 1)) == 0) ? 0 : 1);
  int j = 32;
  int i;

  for (i = 0; i < 6; i++) {
    int k = (((x & t[i]) == 0) ? 0 : j);
    y += k;
    x >>= k;
    j >>= 1;
  }

  return y;
}


#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
  printf("%d\n", ceil_log2(atol(argv[1])));

  return 0;
}
Semitics answered 10/3, 2013 at 21:2 Comment(4)
I've remarked this as the best answer. For anyone following the discussion, the reason this is currently the best answer is the assembly language solutions are platform specific.Anthocyanin
Any chance of rewriting this to use informative variable names, so it's possible to understand how it works?Delao
why do you use int instead of unsigned?Heffernan
@Heffernan the answer will always be between 0 and 64, so any integral type can be used. And since any integer type can be used, I chose the default type "int".Semitics
B
23

If you can limit yourself to gcc, there are a set of builtin functions which return the number of leading zero bits and can be used to do what you want with a little work:

int __builtin_clz (unsigned int x)
int __builtin_clzl (unsigned long)
int __builtin_clzll (unsigned long long)
Buitenzorg answered 17/7, 2010 at 17:23 Comment(0)
S
23

This algorithm has already been posted, but the following implementation is very compact and should optimize into branch-free code.

int ceil_log2(unsigned long long x)
{
  static const unsigned long long t[6] = {
    0xFFFFFFFF00000000ull,
    0x00000000FFFF0000ull,
    0x000000000000FF00ull,
    0x00000000000000F0ull,
    0x000000000000000Cull,
    0x0000000000000002ull
  };

  int y = (((x & (x - 1)) == 0) ? 0 : 1);
  int j = 32;
  int i;

  for (i = 0; i < 6; i++) {
    int k = (((x & t[i]) == 0) ? 0 : j);
    y += k;
    x >>= k;
    j >>= 1;
  }

  return y;
}


#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
  printf("%d\n", ceil_log2(atol(argv[1])));

  return 0;
}
Semitics answered 10/3, 2013 at 21:2 Comment(4)
I've remarked this as the best answer. For anyone following the discussion, the reason this is currently the best answer is the assembly language solutions are platform specific.Anthocyanin
Any chance of rewriting this to use informative variable names, so it's possible to understand how it works?Delao
why do you use int instead of unsigned?Heffernan
@Heffernan the answer will always be between 0 and 64, so any integral type can be used. And since any integer type can be used, I chose the default type "int".Semitics
B
13

If you're compiling for 64-bit processors on Windows, I think this should work. _BitScanReverse64 is an intrinsic function.

#include <intrin.h>
__int64 log2ceil( __int64 x )
{
  unsigned long index;
  if ( !_BitScanReverse64( &index, x ) )
     return -1LL; //dummy return value for x==0

  // add 1 if x is NOT a power of 2 (to do the ceil)
  return index + (x&(x-1)?1:0);
}

For 32-bit, you can emulate _BitScanReverse64, with 1 or 2 calls to _BitScanReverse. Check the upper 32-bits of x, ((long*)&x)[1], then the lower 32-bits if needed, ((long*)&x)[0].

Borchert answered 17/7, 2010 at 17:48 Comment(3)
I was thinking something more general than Windows-specific, but so far yours is the closest to the answer. Translated, the idea is that there is a fast bitwise method for checking whether something is a power of two (shown above). We can use this method in conjunction with a register method for determining the position of the MSB to retrieve the answer.Anthocyanin
+1 and @highperformance: if you only intend to run your code on x86 processors, you can do the bit-scanning yourself with a little bit of assembly. (the BSR instruction, to be specific)Untenable
@casablanca: It's much better to use an intrinsic that the compiler understands than to use actual inline asm. Compilers can do constant-propagation through _BitScanReverse64 / __builtin_clzll, but not through an inline-asm statement. Also, MSVC inline asm is total garbage for wrapping a single instruction. (GNU C inline asm does ok, but still no constant-propagation). Besides, __builtin_clzll is portable GNU C, and will compile to whatever is good on any target machine.Electret
W
9

The fastest approach I'm aware of uses a fast log2 that rounds down, combined unconditional adjustment of input value before and after to handle the rounding up case as in lg_down() shown below.

/* base-2 logarithm, rounding down */
static inline uint64_t lg_down(uint64_t x) {
  return 63U - __builtin_clzl(x);
}

/* base-2 logarithm, rounding up */
static inline uint64_t lg_up(uint64_t x) {
  return lg_down(x - 1) + 1;
}

Basically adding 1 to the rounded-down result is already correct for all values except exact powers of two (since in that case the floor and ceil approaches should return the same answer), so it is sufficient to subtract 1 from the input value to handle that case (it doesn't change the answer for the other cases) and add one to the result.

This is usually slightly faster than the approaches that adjust the value by explicitly checking for exact powers of two (e.g., adding a !!(x & (x - 1)) term). It avoids any comparisons and conditional operations or branches, is more likely to simply when inlining, is more amenable to vectorization, etc.

This relies on the "count leading bits" functionality offered by most CPUs using the clang/icc/gcc builtin __builtin_clzl, but other platforms offer something similar (e.g., the BitScanReverse intrinsic in Visual Studio).

Unfortunately, this many return the wrong answer for log(1), since that leads to __builtin_clzl(0) which is undefined behavior based on the gcc documentation. Of course, the general "count leading zeros" function has perfectly defined behavior at zero, but the gcc builtin is defined in this way because prior to the BMI ISA extension on x86, it would have been using the bsr instruction which itself has undefined behavior.

You could work around this if you know you have the lzcnt instruction by using the lzcnt intrinsic directly. Most platforms other than x86 never went through the bsr mistake in the first place, and probably also offer methods to access their "count leading zeros" instruction if they have one.

Wraith answered 15/7, 2018 at 20:11 Comment(8)
This looks promising, as I've edited the main question to indicate. Have you tested it on the relevant cases and corners, e.g., 0, 1, and 2, the full 32-bit (unsigned) integer space, the three integers around each step change in the logarithm output, and the values near uint64_t max?Anthocyanin
@Anthocyanin - I use a 32-bit version and I tested all 32-but values there. However, there is a caveat regarding __builtin_clzl - passing zero is undefined which means log(1) is undefined. In practice it works when the clz compiles to an lzcnt instruction on x86, or the ARM equivalent, but not if it ends up as bsr. Perhaps there is some way to make it defined but still fast given this behavior of the GCC intrinsic. If you know you are compiling on a platform with lzcnt you can use the intrinsic for that. I will update the answer with these caveats.Wraith
@Wraith very good try, but alas!, my answer below is 33% better in x86-64 (three assembler instructions of comparable complexity versus your four), see the difference here at the compiler explorer. I am not sure why both GCC and Clang (which generates much superior results for this, by the way) penalize your solutionPrediction
@Prediction - in the specific case of clang with no specified architecture, yes. I'm usually targeting machines with lzcnt however, so with -march=haswell they are more comparable - at least on clang, gcc and icc, the solution in this answer is the same number of instructions (4) as your, and arguably slightly better in practice since it avoids lea which runs on fewer ports, and especially "3 component lea" which gcc and icc (but not clang) use, as that runs on 1 port and has 3 cycles of latency.Wraith
All in all, this is probably a bit academic: a user who really cares could test both, and since this function is highly likely to be inlined (and should be if performance matters), how the code works out after inlining ends up being more important.Wraith
@Wraith I have been researching why Clang et al. use a more expensive encoding for this function with -march=haswell. According to several colleagues at the CPPLang slack x86 channel this is a bug, and I am considering to report it. There is a lot to say, since your implementation does not work reliably for 1/64 of the codomain, just like mine does not work for 1/64 of the codomainPrediction
@Prediction - yes, you can find a lot of similar performance bugs the instrinsic definition expands to some fixed recipe, but too late in the list of optimization passes to remove unnecessary operations so you are left with apparently redundant stuff. On GCC it sometimes depends on what -march flag you use, even if the same instructions are selected. About the failure of my solution with log(1), yes it's a shame. It works in principle at the assembly level with lzcnt, but it's unfortunate to have to use intel-specific instructions to get that reliably.Wraith
GCC still does not fix the bug, so it may generate suboptimal code when returning unsigned value on x86_64. This can be worked around by (yet) undocumented target-specific builtins or inline assembly. Clang is not suffering from this bug, though.Kentish
D
6
#include "stdafx.h"
#include "assert.h"

int getpos(unsigned __int64 value)
{
    if (!value)
    {
      return -1; // no bits set
    }
    int pos = 0;
    if (value & (value - 1ULL))
    {
      pos = 1;
    }
    if (value & 0xFFFFFFFF00000000ULL)
    {
      pos += 32;
      value = value >> 32;
    }
    if (value & 0x00000000FFFF0000ULL)
    {
      pos += 16;
      value = value >> 16;
    }
    if (value & 0x000000000000FF00ULL)
    {
      pos += 8;
      value = value >> 8;
    }
    if (value & 0x00000000000000F0ULL)
    {
      pos += 4;
      value = value >> 4;
    }
    if (value & 0x000000000000000CULL)
    {
      pos += 2;
      value = value >> 2;
    }
    if (value & 0x0000000000000002ULL)
    {
      pos += 1;
      value = value >> 1;
    }
    return pos;
}

int _tmain(int argc, _TCHAR* argv[])
{    
    assert(getpos(0ULL) == -1); // None bits set, return -1.
    assert(getpos(1ULL) == 0);
    assert(getpos(2ULL) == 1);
    assert(getpos(3ULL) == 2);
    assert(getpos(4ULL) == 2);
    for (int k = 0; k < 64; ++k)
    {
        int pos = getpos(1ULL << k);
        assert(pos == k);
    }
    for (int k = 0; k < 64; ++k)
    {
        int pos = getpos( (1ULL << k) - 1);
        assert(pos == (k < 2 ? k - 1 : k) );
    }
    for (int k = 0; k < 64; ++k)
    {
        int pos = getpos( (1ULL << k) | 1);
        assert(pos == (k < 1 ? k : k + 1) );
    }
    for (int k = 0; k < 64; ++k)
    {
        int pos = getpos( (1ULL << k) + 1);
        assert(pos == k + 1);
    }
    return 0;
}
Dealfish answered 2/8, 2010 at 20:0 Comment(6)
Lookup table would eliminate the last several if clauses, four or more of them depending on the available memory.Anthocyanin
To use a 16-bit lookup table: declare global short getpos_lookup[1 << 16];. Pre-populate: int i; for(i=1;i<1<<16;i++) getpos_lookup[i]=log2(i); Then under the 16 case comment out the 8/4/2/1 cases and put pos += getpos_lookup[v];.Anthocyanin
Actually i'm not even sure if my version is fast at all. Using a for-loop might be faster than any of the other approaches.Dealfish
I ran some simple tests. Your method is appreciably faster than a loop (e.g. while(value>>=1)pos++;). Modifying your method to use a lookup table is slightly faster, but I wouldn't call it appreciably faster. For my purposes your method is already fast enough. However, if someone were looking to continue improving on it, I would look into: 1. replacing the MSB detection with a register call (potentially using #ifdef statements for portability). 2. employing some heuristics to exploit known distributions of the input (e.g. 90% of incoming numbers are under 1000) 3. use of lookup tablesAnthocyanin
By switching to constants like 0xFFFF0000FFFF0000, 0xFF00FF00FF00FF00, ... you can remove the shifts.Stillborn
@Stillborn if you did that, you would need to mask the input value rather than shift it. Otherwise, you cannot be sure which set of 1s is letting the value through. For instance, the value 0x0000000100100000 passes the first mask and your suggested second mask of 0xFFFF0000FFFF0000 even though bit 32 is the highest one set.Upper
H
6

Using the gcc builtins mentioned by @egosys you can build some useful macros. For a quick and rough floor(log2(x)) calculation you can use:

#define FAST_LOG2(x) (sizeof(unsigned long)*8 - 1 - __builtin_clzl((unsigned long)(x)))

For a similar ceil(log2(x)), use:

#define FAST_LOG2_UP(x) (((x) - (1 << FAST_LOG2(x))) ? FAST_LOG2(x) + 1 : FAST_LOG2(x))

The latter can be further optimized using more gcc peculiarities to avoid the double call to the builtin, but I'm not sure you need it here.

Horseman answered 10/5, 2012 at 17:4 Comment(4)
Thanks. My concern with this is the call to the built-in. It does compile on gcc and clang, which would cover most instances. If I knew it compiled on icc I might go with it. Cross-platform compatibility is a concern. I also wouldn't mind seeing the double call cleaned up as you suggest. (Can you not use (x&(x-1))?)Anthocyanin
You can use #define FAST_LOG2_UP(x) ({ unsigned long log = FAST_LOG2(x); ((x) - (1 << log)) ? log + 1 : log; }) to avoid multiple calls to the builtin. Please note this is again gcc-specific.Horseman
another improvement can be eliminating the branching: log + !!(x ^ (1 << log))Bracer
In my tests, two variations of the AND-MINUS method ((x&(x-1))?1:0) and (!!(x&(x-1))) ran within 1% of each other: probably the same. The XOR-SHIFT method MSB+!!(x^(1<<MSB)) was significantly slower, by about 50%.Anthocyanin
A
5

The following code snippet is a safe and portable way to extend plain-C methods, such as @dgobbi's, to use compiler intrinsics when compiled using supporting compilers (Clang). Placing this at the top of the method will cause the method to use the builtin when it is available. When the builtin is unavailable the method will fall back to the standard C code.

#ifndef __has_builtin
#define __has_builtin(x) 0
#endif

#if __has_builtin(__builtin_clzll) //use compiler if possible
  return ((sizeof(unsigned long long) * 8 - 1) - __builtin_clzll(x)) + (!!(x & (x - 1)));
#endif
Anthocyanin answered 12/1, 2014 at 7:1 Comment(0)
E
4

The true fastest solution:

A binary search tree of 63 entries. These are the powers of 2 from 0 to 63. One-time generation function to create the tree. The leafs represent the log base 2 of the powers (basically, the numbers 1-63).

To find the answer, you feed a number into the tree, and navigate to the leaf node greater than the item. If the leaf node is exactly equal, result is the leaf value. Otherwise, result is the leaf value + 1.

Complexity is fixed at O(6).

Emrick answered 17/7, 2010 at 20:27 Comment(4)
of course you don't really need a tree, just adapt the bisection method for root finding to halve the interval each iteration.Tao
Dave: Thanks. Greg: Of course. Tree is easier to visualize for the uninitiated, though.Emrick
In practice, I think this kind of branching will be slowish on the CPU (a benchmark would be interesting!). Faster is possible because we can use instructions which operate on 64-bits in parallel. Anyways, your approach is like an unrolled version of "Find the log base 2 of an N-bit integer in O(lg(N)) operations" in the link. And... O(6)=O(1)=O(99999)Borchert
The binary search does not require any branching. It can be done entirely with arithmetic on sign bits/carry flags.Autoeroticism
T
3

Finding the log base 2 of an integer (64-bit or any other bit) with integer output is equivalent to finding the most significant bit that is set. Why? Because log base 2 is how many times you can divide the number by 2 to reach 1.

One way to find the MSB that's set is to simply bitshift to the right by 1 each time until you have 0. Another more efficient way is to do some kind of binary search via bitmasks.

The ceil part is easily worked out by checking if any other bits are set other than the MSB.

Thyrsus answered 17/7, 2010 at 17:11 Comment(6)
Don't you mean bitshift to the right by 1 each time until you have 1? And since he wants the ceiling, it's that number + 1.Emrick
@Computer Guru: No, it is not "that number + 1". It is "that number + 1 if the result isn't exact".Canella
Right, and the odds are that it won't be exact. So "that number + 1" is more accurate more often than "that number" :)Emrick
Ok, what about this: Calculate the number of "1" bits, and sum 1 if the number of "1" bits is greater than 1.Sham
I know generally how to go about it. The difficulty is figuring out which particular set of bit-wise operations is best. I was hoping there was an existing solution similar to those on the Stanford bithacks page.Anthocyanin
There are faster methods than bitshifting to the right by 1. https://mcmap.net/q/19995/-given-an-integer-how-do-i-find-the-next-largest-power-of-two-using-bit-twiddlingMise
W
3

The code below is simpler and will work as long as the input x >= 1. input clog2(0) will get an undefined answer (which makes sense because log(0) is infinity...) You can add error checking for (x == 0) if you want:

unsigned int clog2 (unsigned int x)
{
    unsigned int result = 0;
    --x;
    while (x > 0) {
        ++result;
        x >>= 1;
    }

    return result;
}

By the way, the code for the floor of log2 is similar: (Again, assuming x >= 1)

unsigned int flog2 (unsigned int x)
{
    unsigned int result = 0;
    while (x > 1) {
        ++result;
        x >>= 1;
    }

    return result;
}
Waltraudwaltz answered 28/3, 2014 at 0:15 Comment(0)
P
3

I will give you the fastest way for x86-64 at the time of writing, and a general technique if you have a fast floor that works for arguments < 2^63, if you care for the full range, then see below.

I am surprised about the low quality of the other answers because they tell you how to get the floor but transform the floor in a very expensive way (using conditionals and everything!) to the ceiling.

If you can get the floor of the logarithm quickly, for example, using __builtin_clzll, then the floor is very easily obtained like this:

unsigned long long log2Floor(unsigned long long x) {
    return 63 - __builtin_clzll(x);
}

unsigned long long log2Ceiling(unsigned long long x) {
    return log2Floor(2*x - 1);
}

It works because it adds 1 to the result unless x is exactly a power of 2.

See the x86-64 assembler difference at the compiler explorer for another implementation of ceiling like this:

auto log2CeilingDumb(unsigned long x) {
    return log2Floor(x) + (!!(x & (x - 1)));
}

Gives:

log2Floor(unsigned long): # @log2Floor(unsigned long)
  bsr rax, rdi
  ret
log2CeilingDumb(unsigned long): # @log2CeilingDumb(unsigned long)
  bsr rax, rdi
  lea rcx, [rdi - 1]
  and rcx, rdi
  cmp rcx, 1
  sbb eax, -1
  ret
log2Ceiling(unsigned long): # @log2Ceiling(unsigned long)
  lea rax, [rdi + rdi]
  add rax, -1
  bsr rax, rax
  ret

For the full range, it is in a previous answer: return log2Floor(x - 1) + 1, this is significantly slower since it uses in x86-64 four instructions compared to three above.

Prediction answered 16/10, 2018 at 21:35 Comment(2)
I think your characterization of the "low quality" of the other answers "because they tell you how to get the floor but transform the floor in a very expensive" is off base, since the other answers included my answer which is substantially identical to yours (you simply replace one addition by 1 with a * 2). The extra instruction is more or less just a weird quirk of compilation on the compiler you chose. If you used gcc instead, it takes 6 instructions.Wraith
In general I would expect the code generated to very comparable, across various compilers and hardware, except of course that this version returns the wrong answer for half of the input range, and mine exhibits undefined behavior for the input "1" (but returns the right answer in practice when lzcnt or similar is used).Wraith
C
3

I have benchmarked several implementations of a 64 bit "highest bit". The most "branch free" code is not in fact the fastest.

This is my highest-bit.c source file:

int highest_bit_unrolled(unsigned long long n)
{
  if (n & 0xFFFFFFFF00000000) {
    if (n & 0xFFFF000000000000) {
      if (n & 0xFF00000000000000) {
        if (n & 0xF000000000000000) {
          if (n & 0xC000000000000000)
            return (n & 0x8000000000000000) ? 64 : 63;
          else
            return (n & 0x2000000000000000) ? 62 : 61;
        } else {
          if (n & 0x0C00000000000000)
            return (n & 0x0800000000000000) ? 60 : 59;
          else
            return (n & 0x0200000000000000) ? 58 : 57;
        }
      } else {
        if (n & 0x00F0000000000000) {
          if (n & 0x00C0000000000000)
            return (n & 0x0080000000000000) ? 56 : 55;
          else
            return (n & 0x0020000000000000) ? 54 : 53;
        } else {
          if (n & 0x000C000000000000)
            return (n & 0x0008000000000000) ? 52 : 51;
          else
            return (n & 0x0002000000000000) ? 50 : 49;
        }
      }
    } else {
      if (n & 0x0000FF0000000000) {
        if (n & 0x0000F00000000000) {
          if (n & 0x0000C00000000000)
            return (n & 0x0000800000000000) ? 48 : 47;
          else
            return (n & 0x0000200000000000) ? 46 : 45;
        } else {
          if (n & 0x00000C0000000000)
            return (n & 0x0000080000000000) ? 44 : 43;
          else
            return (n & 0x0000020000000000) ? 42 : 41;
        }
      } else {
        if (n & 0x000000F000000000) {
          if (n & 0x000000C000000000)
            return (n & 0x0000008000000000) ? 40 : 39;
          else
            return (n & 0x0000002000000000) ? 38 : 37;
        } else {
          if (n & 0x0000000C00000000)
            return (n & 0x0000000800000000) ? 36 : 35;
          else
            return (n & 0x0000000200000000) ? 34 : 33;
        }
      }
    }
  } else {
    if (n & 0x00000000FFFF0000) {
      if (n & 0x00000000FF000000) {
        if (n & 0x00000000F0000000) {
          if (n & 0x00000000C0000000)
            return (n & 0x0000000080000000) ? 32 : 31;
          else
            return (n & 0x0000000020000000) ? 30 : 29;
        } else {
          if (n & 0x000000000C000000)
            return (n & 0x0000000008000000) ? 28 : 27;
          else
            return (n & 0x0000000002000000) ? 26 : 25;
        }
      } else {
        if (n & 0x0000000000F00000) {
          if (n & 0x0000000000C00000)
            return (n & 0x0000000000800000) ? 24 : 23;
          else
            return (n & 0x0000000000200000) ? 22 : 21;
        } else {
          if (n & 0x00000000000C0000)
            return (n & 0x0000000000080000) ? 20 : 19;
          else
            return (n & 0x0000000000020000) ? 18 : 17;
        }
      }
    } else {
      if (n & 0x000000000000FF00) {
        if (n & 0x000000000000F000) {
          if (n & 0x000000000000C000)
            return (n & 0x0000000000008000) ? 16 : 15;
          else
            return (n & 0x0000000000002000) ? 14 : 13;
        } else {
          if (n & 0x0000000000000C00)
            return (n & 0x0000000000000800) ? 12 : 11;
          else
            return (n & 0x0000000000000200) ? 10 : 9;
        }
      } else {
        if (n & 0x00000000000000F0) {
          if (n & 0x00000000000000C0)
            return (n & 0x0000000000000080) ? 8 : 7;
          else
            return (n & 0x0000000000000020) ? 6 : 5;
        } else {
          if (n & 0x000000000000000C)
            return (n & 0x0000000000000008) ? 4 : 3;
          else
            return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
        }
      }
    }
  }
}

int highest_bit_bs(unsigned long long n)
{
  const unsigned long long mask[] = {
    0x000000007FFFFFFF,
    0x000000000000FFFF,
    0x00000000000000FF,
    0x000000000000000F,
    0x0000000000000003,
    0x0000000000000001
  };
  int hi = 64;
  int lo = 0;
  int i = 0;

  if (n == 0)
    return 0;

  for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
    int mi = lo + (hi - lo) / 2;

    if ((n >> mi) != 0)
      lo = mi;
    else if ((n & (mask[i] << lo)) != 0)
      hi = mi;
  }

  return lo + 1;
}

int highest_bit_shift(unsigned long long n)
{
  int i = 0;
  for (; n; n >>= 1, i++)
    ; /* empty */
  return i;
}

static int count_ones(unsigned long long d)
{
  d = ((d & 0xAAAAAAAAAAAAAAAA) >>  1) + (d & 0x5555555555555555);
  d = ((d & 0xCCCCCCCCCCCCCCCC) >>  2) + (d & 0x3333333333333333);
  d = ((d & 0xF0F0F0F0F0F0F0F0) >>  4) + (d & 0x0F0F0F0F0F0F0F0F);
  d = ((d & 0xFF00FF00FF00FF00) >>  8) + (d & 0x00FF00FF00FF00FF);
  d = ((d & 0xFFFF0000FFFF0000) >> 16) + (d & 0x0000FFFF0000FFFF);
  d = ((d & 0xFFFFFFFF00000000) >> 32) + (d & 0x00000000FFFFFFFF);
  return d;
}

int highest_bit_parallel(unsigned long long n)
{
  n |= n >> 1;
  n |= n >> 2;
  n |= n >> 4;
  n |= n >> 8;
  n |= n >> 16;
  n |= n >> 32;
  return count_ones(n);
}

int highest_bit_so(unsigned long long x)
{
  static const unsigned long long t[6] = {
    0xFFFFFFFF00000000ull,
    0x00000000FFFF0000ull,
    0x000000000000FF00ull,
    0x00000000000000F0ull,
    0x000000000000000Cull,
    0x0000000000000002ull
  };

  int y = (((x & (x - 1)) == 0) ? 0 : 1);
  int j = 32;
  int i;

  for (i = 0; i < 6; i++) {
    int k = (((x & t[i]) == 0) ? 0 : j);
    y += k;
    x >>= k;
    j >>= 1;
  }

  return y;
}

int highest_bit_so2(unsigned long long value)
{
  int pos = 0;
  if (value & (value - 1ULL))
  {
    pos = 1;
  }
  if (value & 0xFFFFFFFF00000000ULL)
  {
    pos += 32;
    value = value >> 32;
  }
  if (value & 0x00000000FFFF0000ULL)
  {
    pos += 16;
    value = value >> 16;
  }
  if (value & 0x000000000000FF00ULL)
  {
    pos += 8;
    value = value >> 8;
  }
  if (value & 0x00000000000000F0ULL)
  {
    pos += 4;
    value = value >> 4;
  }
  if (value & 0x000000000000000CULL)
  {
    pos += 2;
    value = value >> 2;
  }
  if (value & 0x0000000000000002ULL)
  {
    pos += 1;
    value = value >> 1;
  }
  return pos;
}

This is highest-bit.h:

int highest_bit_unrolled(unsigned long long n);
int highest_bit_bs(unsigned long long n);
int highest_bit_shift(unsigned long long n);
int highest_bit_parallel(unsigned long long n);
int highest_bit_so(unsigned long long n);
int highest_bit_so2(unsigned long long n);

And the main program (sorry about all the copy and paste):

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include "highest-bit.h"

double timedelta(clock_t start, clock_t end)
{
  return (end - start)*1.0/CLOCKS_PER_SEC;
}

int main(int argc, char **argv)
{
  int i;
  volatile unsigned long long v;
  clock_t start, end;

  start = clock();

  for (i = 0; i < 10000000; i++) {
    for (v = 0x8000000000000000; v; v >>= 1)
      highest_bit_unrolled(v);
  }

  end = clock();

  printf("highest_bit_unrolled = %6.3fs\n", timedelta(start, end));

  start = clock();

  for (i = 0; i < 10000000; i++) {
    for (v = 0x8000000000000000; v; v >>= 1)
      highest_bit_parallel(v);
  }

  end = clock();

  printf("highest_bit_parallel = %6.3fs\n", timedelta(start, end));

  start = clock();

  for (i = 0; i < 10000000; i++) {
    for (v = 0x8000000000000000; v; v >>= 1)
      highest_bit_bs(v);
  }

  end = clock();

  printf("highest_bit_bs = %6.3fs\n", timedelta(start, end));

  start = clock();

  for (i = 0; i < 10000000; i++) {
    for (v = 0x8000000000000000; v; v >>= 1)
      highest_bit_shift(v);
  }

  end = clock();

  printf("highest_bit_shift = %6.3fs\n", timedelta(start, end));

  start = clock();

  for (i = 0; i < 10000000; i++) {
    for (v = 0x8000000000000000; v; v >>= 1)
      highest_bit_so(v);
  }

  end = clock();

  printf("highest_bit_so = %6.3fs\n", timedelta(start, end));

  start = clock();

  for (i = 0; i < 10000000; i++) {
    for (v = 0x8000000000000000; v; v >>= 1)
      highest_bit_so2(v);
  }

  end = clock();

  printf("highest_bit_so2 = %6.3fs\n", timedelta(start, end));

  return 0;
}

I've tried this various Intel x86 boxes, old and new.

The highest_bit_unrolled (unrolled binary search) is consistently significantly faster than highest_bit_parallel (branch-free bit ops). This is faster than than highest_bit_bs (binary search loop), and that in turn is faster than highest_bit_shift (naive shift and count loop).

highest_bit_unrolled is also faster than the one in the accepted SO answer (highest_bit_so) and also one given in another answer (highest_bit_so2).

The benchmark cycles through a one-bit mask that covers successive bits. This is to try to defeat branch prediction in the unrolled binary search, which is realistic: in a real world program, the input cases are unlikely to exhibit locality of bit position.

Here are results on an old Intel(R) Core(TM)2 Duo CPU E4500 @ 2.20GHz:

$ ./highest-bit
highest_bit_unrolled =  6.090s
highest_bit_parallel =  9.260s
highest_bit_bs = 19.910s
highest_bit_shift = 21.130s
highest_bit_so =  8.230s
highest_bit_so2 =  6.960s

On a newer model Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz:

highest_bit_unrolled =  1.555s
highest_bit_parallel =  3.420s
highest_bit_bs =  6.486s
highest_bit_shift =  9.505s
highest_bit_so =  4.127s
highest_bit_so2 =  1.645s

On the newer hardware, highest_bit_so2 comes closer to highest_bit_unrolled on the newer hardware. The order is not quite the same; now highest_bit_so really falls behind, and is slower than highest_bit_parallel.

The fastest one, highest_bit_unrolled contains the most code with the most branches. Every single return value reached by a different set of conditions with its own dedicated piece of code.

The intuition of "avoid all branches" (due to worries about branch mis-predictions) is not always right. Modern (and even not-so-modern any more) processors contain considerable cunning in order not to be hampered by branching.


P.S. the highest_bit_unrolled was introduced in the TXR language in December 2011 (with mistakes, since debugged).

Recently, I started wondering whether some nicer, more compact code without branches might not be faster.

I'm somewhat surprised by the results.

Arguably, the code should really be #ifdef-ing for GNU C and using some compiler primitives, but as far as portability goes, that version stays.

Capsaicin answered 19/2, 2019 at 1:47 Comment(3)
This is good work. I think you are into new territory on the question. Conceding that, it's probably worth reevaluating where the goalposts should be. I would guess, without evidence, that the speedup here hinges on the predictable nature of the incremental for loop in the benchmark, and while that has its applications, a more realistic model might be random sampling, which adds its own obstacles to producing legitimate timings. You've probably broken the question, and so rehabilitating it means reframing it in a meaningful and useful way, which may or may not be possible.Anthocyanin
Nice. I suspect the reason highest_bit_unrolled is fastest is that it's careful to issue not a single CPU store cycle. With fetch cycles only, the CPU can really plow ahead. I'm no expert, but this seems to suggest that in general, bus coherency delays are significantly more expensive than branch mis-prediction penalties (which matches my crude intuition).Buckthorn
@GlennSlayden We'd have to examine the machine code to be sure, but stores to local variables should translate to just register manipulation. However, there are data flow dependencies there which don't parallelize. In the highest_bit_so2 function, for instance, each addition to the output accumulator depends on examining the input value, which is being conditionally shifted. So this register cannot easily be mapped to a set of registers in order to parallelize the code; and there will be pipeline stalls.Capsaicin
A
2

If you have 80-bit or 128-bit floats available, cast to that type and then read off the exponent bits. This link has details (for integers up to 52 bits) and several other methods:

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogIEEE64Float

Also, check the ffmpeg source. I know they have a very fast algorithm. Even if it's not directly extensible to larger sizes, you can easily do something like if (x>INT32_MAX) return fastlog2(x>>32)+32; else return fastlog2(x);

Autoeroticism answered 17/7, 2010 at 17:29 Comment(0)
U
2

The naive linear search may be an option for evenly distributed integers, since it needs slightly less than 2 comparisons on average (for any integer size).

/* between 1 and 64 comparisons, ~2 on average */
#define u64_size(c) (              \
    0x8000000000000000 < (c) ? 64  \
  : 0x4000000000000000 < (c) ? 63  \
  : 0x2000000000000000 < (c) ? 62  \
  : 0x1000000000000000 < (c) ? 61  \
...
  : 0x0000000000000002 < (c) ?  2  \
  : 0x0000000000000001 < (c) ?  1  \
  :                             0  \
)
Uhhuh answered 3/4, 2015 at 0:55 Comment(4)
Integers are almost never uniformly distributed over 0..UINT64_MAX. Especially not integers that you'll want to use this function on. The Law of Smaller Numbers is that most numbers in computer programs are small most of the time, and it's a win to optimize for that case. Not downvoting because it's an interesting observation, and useful for the very rare cases where it applies (and the target machine doesn't have the instruction in hardware).Electret
@PeterCordes Except... that well-formulated hash codes have the explicit goal of being formulated as uniformly as possible, so insofar as this code is appropriate for that condition (as claimed), its utility may not be so obscure.Buckthorn
@GlennSlayden: Is it common to want ilog2 on hash codes? For use in a hash table, you'd take modulo the size of the hash table, just an AND assuming you keep the size as a power of 2. I'm sure there are some use-cases involving uniformly distributed full-range inputs, but hashes seem an unlikely example unless I'm missing a use-case where that makes sense.Electret
@PeterCordes Oh, I see your point. I was momentarily confused about needing to do a log operation in my actual use case, as opposed to more generally giving a counter-example to the Law of Smaller Numbers. For the record, the videoId (64-bit) or channelId (128-bit) values issued by YouTube are basically (opaque, permanent) integers of the indicated size which are suprisingly uniform in distribution over the full respective ranges. But yes, they're essentially proxy tokens for content, so while my app certainly needs to hash them (a lot), there's little need for doing log on them.Buckthorn
M
1

A sort of binary search for the highest order bit.

A good reference on bitwise operation hacks posted on stanford.edu: Bit Twiddling Hacks. There are a lot of examples in that reference for many things, including several algorithms for computing log2 values. This includes similar - and maybe better - approaches to this one.

I had the idea that a sort of binary search for the most significant bit would be faster than the max 60+ loops of the simple solution. The algorithm shifts n by half of it's last shift value each iteration to zero in on the msb.

For 64-bit values, it requires 6 iterations to determine the position of the most significant bit - which is the floor log2 of n.

The performance of this approach, as I measured it, wasn't much better than the simple loop in the 2nd example below.

// floor log2 in 6 max iterations for unsigned 64-bit values.
//
int floor_log2(unsigned long long n)
{
    int shval = 32;
    int msb   =  0;
    while (shval) {
        if (n >> shval) {
            msb += shval;
            n  >>= shval;
        }
        shval >>= 1;
    }
    return msb;
}

int ceil_log2(unsigned long long n) 
{
    return floor_log2(2 * n - 1);
}

Simple algorithm:

int floor_log2_simple(unsigned long long n)
{
    int c =  0;
    while (n) {
        n >>= 1;
        c++;
    }
    return c - 1;
}

These same algorithms implemented in Rust performed nearly identically. The top approach performed a tiny amount better - but by only by a couple nano seconds on average. No amount of tweaking the algorithm gave any better performance.

meh...

Mojgan answered 22/6, 2021 at 21:0 Comment(2)
This does 6 iterations in the worst case, not 4.Doublecheck
You're right - but It actually invariably does 6 iterations every time.Mojgan

© 2022 - 2024 — McMap. All rights reserved.