What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?
Asked Answered
O

34

164

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the farthest left bit that is a 1), what is the quickest/most efficient method of finding out?

I know that POSIX supports a ffs() method in <strings.h> to find the first set bit, but there doesn't seem to be a corresponding fls() method.

Is there some really obvious way of doing this that I'm missing?

What about in cases where you can't use POSIX functions for portability?

EDIT: What about a solution that works on both 32- and 64-bit architectures (many of the code listings seem like they'd only work on 32-bit integers).

Oleg answered 22/3, 2009 at 23:37 Comment(6)
there's a few implementations here: graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightLinear (Edit: After rereading your question, I realise that the link above is for finding the rightmost set bit, not leftmost as you require, although without a sense of word size, it's a tricky one to answer)Serpigo
See "Number of leading zeros algorithms" in Hacker's Delight.Turnspit
That counts zeros on the right; the question was about zeros on the left. At least, in a quick skim I don't see it there.Turnspit
do you specifically want the bit number 'n', or would 2 ^ n suffice?Wizened
Look at the "Log Base 2" algorithms - as Anderson says in the article: "The log base 2 of an integer is the same as the position of the highest bit set (or most significant bit set, MSB)"Elate
"Number of leading zeros algorithms" in Hacker's Delight is invalid,there is wayback machine versionIsidoro
K
83

GCC has:

 -- Built-in Function: int __builtin_clz (unsigned int x)
     Returns the number of leading 0-bits in X, starting at the most
     significant bit position.  If X is 0, the result is undefined.

 -- Built-in Function: int __builtin_clzl (unsigned long)
     Similar to `__builtin_clz', except the argument type is `unsigned
     long'.

 -- Built-in Function: int __builtin_clzll (unsigned long long)
     Similar to `__builtin_clz', except the argument type is `unsigned
     long long'.

I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.


A useful trick if your input can be zero is __builtin_clz(x | 1): unconditionally setting the low bit without modifying any others makes the output 31 for x=0, without changing the output for any other input.

To avoid needing to do that, your other option is platform-specific intrinsics like ARM GCC's __clz (no header needed), or x86's _lzcnt_u32 on CPUs that support the lzcnt instruction. (Beware that lzcnt decodes as bsr on older CPUs instead of faulting, which gives 31-lzcnt for non-zero inputs.)

There's unfortunately no way to portably take advantage of the various CLZ instructions on non-x86 platforms that do define the result for input=0 as 32 or 64 (according to the operand width). x86's lzcnt does that, too, while bsr produces a bit-index that the compiler has to flip unless you use 31-__builtin_clz(x).

(The "undefined result" is not C Undefined Behavior, just a value that isn't defined. It's actually whatever was in the destination register when the instruction ran. AMD documents this, Intel doesn't, but Intel's CPUs do implement that behaviour. But it's not whatever was previously in the C variable you're assigning to, that's not usually how things work when gcc turns C into asm. See also Why does breaking the "output dependency" of LZCNT matter?)

Kerrikerrie answered 23/3, 2009 at 15:16 Comment(3)
MSVC will have _BitScanReverseMonad
The undefined-on-zero behaviour lets them compile to a single BSR instruction on x86, even when LZCNT isn't available. This is a big advantage for __builtin_ctz over ffs, which compiles to a BSF and a CMOV to handle the input-was-zero case. On architectures without a short-enough implementation (e.g. old ARM without the clz instruction), gcc emits a call to a libgcc helper function.Toe
I just tried, and using ARM's official build of gcc for arm-none-eabi and __clz in fact is not available without a header. Furthermore, in my copy of CMSIS (which should be fairly current), __CLZ is defined for GCC as __builting_gcc, with all the drawbacks of that. For a full discussion see this issue - apparently it was made to allow better optimizations, even if it has it's drawbacks.Tiddly
S
50

Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

unsigned int v;
unsigned r = 0;

while (v >>= 1) {
    r++;
}

This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.

If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.

NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.

Sizable answered 11/2, 2011 at 15:31 Comment(8)
It seems to work, but I do not fully understand why we don't shift out bits forever and get stuck in the while loop.Overbuild
@Overbuild Stepping through with a debugger can help explain why the loop exits. Basically, its' because the expression in the condition evaluates to 0 (which is treated as false) once the last 1 bit has been shifted off the right.Sizable
Nice idea to use the end result like that :)Overbuild
note: must be unsigned, for signed integers the right shift fails for negative numbers.Sausa
Xantix: The shift in C/C++ is a logical shift, so it works fine. For Java, JavaScript, or D, you need to use the logical shift operator >>>. Plus probably the comparator != 0, and some unspecified number of parenthesis.Ellis
might anyone be interested, this is 2 times faster than return (unsigned int)log2(val); (no optimizations)Insignificant
@Chase: No it's not. It's a logical shift for unsigned. For signed, it may or may not be a logical shift (and it's usually arithmetic, in fact).Raja
"this is 2 times faster than return (unsigned int)log2(val)" -- the faintest praise.Loisloise
P
44

Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:

Searches the source operand for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand. The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. If the content source operand is 0, the content of the destination operand is undefined.

(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)

Example code for gcc:

#include <iostream>

int main (int,char**)
{
  int n=1;
  for (;;++n) {
    int msb;
    asm("bsrl %1,%0" : "=r"(msb) : "r"(n));
    std::cout << n << " : " << msb << std::endl;
  }
  return 0;
}

See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.

Pereyra answered 23/3, 2009 at 0:0 Comment(10)
Actually this instruction is usually microcoded into a loop and is rather slow.Chuff
Which one ? BSR or CNTLZ ? As I read the x86-timing.pdf referenced above, BSR is only slow on the Netburst Pentiums. I know nothing about PowerPC though.Pereyra
...OK, on closer inspection make that "BSR is only fast on P3/Pentium-M/Core2 x86s". Slow on Netburst and AMD.Pereyra
cntlzw is a pipelined (not microcoded) op on most PPCs. Looks like it's about as fast as an add on Xenon.Saurian
@timday: also slow on (some?) Atom µarches.Troytroyer
@rlbond: Yes, but is it slower than some clever instruction sequence? How do I find out?Denbrook
There's absolutely no substitute for benchmarking your own (realistic) use cases. Whether the clever algorithms are faster or not might depend on what else the CPU's execution units is tied up with from "nearby" code (if anything), what the distribution of the values you're testing is... all sorts of things.Pereyra
Just a heads up: Your last two links are dead.Neckwear
If you're using GNU C anyway, you should use use __builtin_clz (or __builtin_clzll), which has the same undefined-on-zero behaviour that lets it compile to a single BSR on x86. Or LZCNT if available, because that's faster on more CPUs (e.g. on AMD it's fast even though BSR is slow, maybe because BSR has the weird behaviour of setting ZF according to the input, not the result). Or whatever is optimal on the target arch, since it's not limited to x86. Anyway, gcc.gnu.org/wiki/DontUseInlineAsm when you can avoid it, since it defeats constant-propagation and some other optimizations.Toe
@rlbond: huh, BSR on P4 Prescott is 2 uops with 16 cycle latency(!), with one per 4c throughput. But on earlier Netburst, it's only 4 cycle latency (still 2 uops), and one per 2c throughput. (source: agner.org/optimize). On most CPUs, it also has a dependency on its output which gcc doesn't account for (when the input is zero, the actual behaviour is to leave the destination unchanged). This can lead to problems like stackoverflow.com/questions/25078285/…. IDK why gcc missed BSR when fixing that.Toe
L
24

This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.

My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.

double ff=(double)(v|1);
return ((*(1+(uint32_t *)&ff))>>20)-1023;  // assumes x86 endianness

This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.

It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.


To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.

// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");
// and also static_assert something about FLT_ENDIAN?

double ff=(double)(v|1);

uint32_t tmp;
memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));
return (tmp>>20)-1023;

Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.


This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).

This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD

Luna answered 22/3, 2009 at 23:49 Comment(19)
Won't that cause a load hit store on moving between register sets?Saurian
Isn't there a cost to loading/reading the FPU (as well as synchronizing with it)?Elate
Yes. And gcc will do nasty things with code like this with -O2 due to type-aliasing optimizations.Recluse
casting between integer and floating point can be surprisingly expensive on x86 CPU'sPettifogger
Yep, the FPU costs are high. But actual time measurements showed this was faster than all-bit ops or especially any loops. Try it and take the fastest is always the best advice. I have not had a problem with GCC and -O2 with this though.Luna
You can do this much more portably (and safely) with frexpCrofoot
@Chris: And much more slowly. :-)Breland
Isn't this undefined behaviour (reading a value through a pointer of an incompatible type)?Mountainous
Hacker's Delight explains how to correct for the error in 32-bit floats in 5-3 Counting Leading 0's. Here's their code, which uses an anonymous union to overlap asFloat and asInt: k = k & ~(k >> 1); asFloat = (float)k + 0.5f; n = 158 - (asInt >> 23); (and yes, this relies on implementation-defined behavior)Scholar
On the PowerPC, moving a value from a floating point register to an integer register (or vice versa) involves writing to memory and reading it back again.Loftin
This is a really neat idea, and potentially useful for a SIMD implementations without AVX512CD VPLZCNTD where type-punning an integer to float is free (because they use the same vector registers). For scalar of course, bsr is faster on all x86 CPUs newer than Pentium4 at least, so use an intrinsic for that. Anyway, this would be a much better answer if you avoided the strict-aliasing UB and the assumption that unsigned long was a 32-bit type.Toe
Great idea - I had the same idea but then later realized I ran across your post before and that probably seeded my brain with the idea. I am pretty sure this is the fastest method in c# since lzcnt is not supported without high-overhead hacks. Since c#'s compiler is more locked down it would be a tad safer to use... except for maybe non-x86.Terrieterrier
warning: the code as shown returns the incorrect answer for values in the range [7FFFFFFFFFFFFE00 - 7FFFFFFFFFFFFFFF]; all of these return 63 instead of the correct MSB 62. Taking 7FFFFFFFFFFFFFFF with the code above, OR-ing with 1 has no effect, and the IEEE double representation is 43E0000000000000. Applying unsigned >> 20 to the high dword (or >> 52 on the quad) gives 0x43e which is 1086. Finally, 1086 - 1023 = 63. This range of 512 error values are the only ones I'm currently aware of, but I haven't done an exhaustive check so there may be others.Predator
Another range of errors is [FFFFFFFFFFFFFC00 - FFFFFFFFFFFFFFFF]; these 1024 values all return 64 instead of the correct MSB index 63.Predator
...and here's a fix that works correctly over the entire range of 64-bit integers: double d = v & ~(v >> 1); ... int msb = (int)(*(ulong*)&d >> 52) - 1023; ... You can stop here with just that if you don't mind that the (already dubious) input value 0 gives result msb = -1023. If you prefer a prettier -1 for this special case, just tack on the following line ... msb |= (msb >> 31); at the end.Predator
If the effects of GCC optimization on this are so important, volatile can help here to ensure desired behavior regardless of optimization level.Javanese
@Javanese Are you responding to my comment? The errors I describe are not compiler errors, they are a property of the CPU, basically a hardware limitation of the IEEE technique shown in this answer. And it's not really the CPU's fault either; I believe what's going on is that information is lost because the IEEE spec is deliberately designed to provide more resolution near 1.0 at the expense of the +/- extremes, which is great for floating point values, but catastrophic when the FP unit is rudely co-opted for processing discrete integer values (and even worse, also under bitwise interpretation).Predator
@GlennSlayden No, actually MSN's comment. Thank you for your contribution.Javanese
hats off for an original solution and implementationArcade
A
19

This should be lightning fast:

int msb(unsigned int v) {
  static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,
    30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
    16, 7, 26, 12, 18, 6, 11, 5, 10, 9};
  v |= v >> 1;
  v |= v >> 2;
  v |= v >> 4;
  v |= v >> 8;
  v |= v >> 16;
  v = (v >> 1) + 1;
  return pos[(v * 0x077CB531UL) >> 27];
}
Amortizement answered 23/3, 2009 at 0:32 Comment(12)
7 bit shifts, 5 or instructions, a multiplty and a potential cache miss. :) Did you benchmark it, or look at the assembler generated? It could end up quite slow, depending on how much of it the compiler can eliminate.Pettifogger
Hi, just wondering jalf, how do you now the thing with the possible cache miss?Benefactress
Arno's answer doesn't work beyond 24 bits, rlbond's answer doesn't even work (it returns an answer where the most significant one bit is turned on and all the other bits are turned off--not the answer desired), and Vasil's code fragment isn't complete.Amortizement
The "possible cache miss" is probably due to this code requiring access to its lookup table. If that table is not cached when this is called, there will be a stall while it's fetched. This might make the worst-case performance far worse than the solutions not using a LUT.Bumboat
@jalf: With you on the multiply, and the dependency chain of shifts is not ideal, but you're worried about cache misses? LOL... The only time a cache miss would hurt performance is if this routine was being called often in a tight loop, in which case the entire table would most likely be cached by call #2! Cache performance on a read-only table of 32 ints is not ever something to worry about.Toodleoo
not really the point. It uses a lot more data cache than necessary (more than one cache line, even), and more instruction cache than necessary. You'll likely get cache misses that could have been avoided the first time you call the function, and it will pollute the cache more than necessary, so after the call, other code might encounter more misses than necessary. LUT's often aren't worth the trouble because cache misses are expensive. But I only said it was something I'd want to benchmark before I claimed it was "lightning fast". Not that it is definitely a problem.Pettifogger
it does seem that this is 1.85 times faster than Quin Taylors solutionInsignificant
The problem is at the end where you extract the most significant bit of the generated mask. You do (v >> 1) + 1, which does erraneously set the least significant bit to one (the +1..). If you change the order, (v + 1) >> 1, now you miss the case where msb is 0x80000000. 0x80000000 expanded into bit mask is 0xffffffff, + 1, is zero, not 0x80000000. The correct way to handle this is to do the expansion into mask like you do.. but instead of isolating the highest set bit use the DeBruijn variant where you get unique hash with mask instead of single isolated bit. Both are perfect unique.Ireland
static const uint8 table[] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; return table[(v * 0x07c4acdd) >> 27];Ireland
Never mind, it was my bad; I compiled this on a 64 bit compiler so the 0x077CB531UL had to be changed into 0x077CB531 to get a correct result. The mask-debruijn variant saves two instructions because the highest set bit doesn't need to be isolated. I recommend 32 bit multiplication even when using 64 bit compiler. :)Ireland
The table has 32 entries, and every value is < 255 (127), so define the table as type unsigned char, and it will fit in a single 32 byte L1 cache line. And the whole thing fits in two cache lines.Hals
Re: have provided the only answer with source code that actually works, this answer fails when unsigned is not 32-bit. Good, but not universal.Headlight
T
14

Kaz Kylheku here

I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.

(I happen to need this "find highest bit" for something, you see.)

I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.

The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.

The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.

Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).

int highest_bit_unrolled(long long n)
{
  if (n & 0x7FFFFFFF00000000) {
    if (n & 0x7FFF000000000000) {
      if (n & 0x7F00000000000000) {
        if (n & 0x7000000000000000) {
          if (n & 0x4000000000000000)
            return 63;
          else
            return (n & 0x2000000000000000) ? 62 : 61;
        } else {
          if (n & 0x0C00000000000000)
            return (n & 0x0800000000000000) ? 60 : 59;
          else
            return (n & 0x0200000000000000) ? 58 : 57;
        }
      } else {
        if (n & 0x00F0000000000000) {
          if (n & 0x00C0000000000000)
            return (n & 0x0080000000000000) ? 56 : 55;
          else
            return (n & 0x0020000000000000) ? 54 : 53;
        } else {
          if (n & 0x000C000000000000)
            return (n & 0x0008000000000000) ? 52 : 51;
          else
            return (n & 0x0002000000000000) ? 50 : 49;
        }
      }
    } else {
      if (n & 0x0000FF0000000000) {
        if (n & 0x0000F00000000000) {
          if (n & 0x0000C00000000000)
            return (n & 0x0000800000000000) ? 48 : 47;
          else
            return (n & 0x0000200000000000) ? 46 : 45;
        } else {
          if (n & 0x00000C0000000000)
            return (n & 0x0000080000000000) ? 44 : 43;
          else
            return (n & 0x0000020000000000) ? 42 : 41;
        }
      } else {
        if (n & 0x000000F000000000) {
          if (n & 0x000000C000000000)
            return (n & 0x0000008000000000) ? 40 : 39;
          else
            return (n & 0x0000002000000000) ? 38 : 37;
        } else {
          if (n & 0x0000000C00000000)
            return (n & 0x0000000800000000) ? 36 : 35;
          else
            return (n & 0x0000000200000000) ? 34 : 33;
        }
      }
    }
  } else {
    if (n & 0x00000000FFFF0000) {
      if (n & 0x00000000FF000000) {
        if (n & 0x00000000F0000000) {
          if (n & 0x00000000C0000000)
            return (n & 0x0000000080000000) ? 32 : 31;
          else
            return (n & 0x0000000020000000) ? 30 : 29;
        } else {
          if (n & 0x000000000C000000)
            return (n & 0x0000000008000000) ? 28 : 27;
          else
            return (n & 0x0000000002000000) ? 26 : 25;
        }
      } else {
        if (n & 0x0000000000F00000) {
          if (n & 0x0000000000C00000)
            return (n & 0x0000000000800000) ? 24 : 23;
          else
            return (n & 0x0000000000200000) ? 22 : 21;
        } else {
          if (n & 0x00000000000C0000)
            return (n & 0x0000000000080000) ? 20 : 19;
          else
            return (n & 0x0000000000020000) ? 18 : 17;
        }
      }
    } else {
      if (n & 0x000000000000FF00) {
        if (n & 0x000000000000F000) {
          if (n & 0x000000000000C000)
            return (n & 0x0000000000008000) ? 16 : 15;
          else
            return (n & 0x0000000000002000) ? 14 : 13;
        } else {
          if (n & 0x0000000000000C00)
            return (n & 0x0000000000000800) ? 12 : 11;
          else
            return (n & 0x0000000000000200) ? 10 : 9;
        }
      } else {
        if (n & 0x00000000000000F0) {
          if (n & 0x00000000000000C0)
            return (n & 0x0000000000000080) ? 8 : 7;
          else
            return (n & 0x0000000000000020) ? 6 : 5;
        } else {
          if (n & 0x000000000000000C)
            return (n & 0x0000000000000008) ? 4 : 3;
          else
            return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);
        }
      }
    }
  }
}

int highest_bit(long long n)
{
  const long long mask[] = {
    0x000000007FFFFFFF,
    0x000000000000FFFF,
    0x00000000000000FF,
    0x000000000000000F,
    0x0000000000000003,
    0x0000000000000001
  };
  int hi = 64;
  int lo = 0;
  int i = 0;

  if (n == 0)
    return 0;

  for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {
    int mi = lo + (hi - lo) / 2;

    if ((n >> mi) != 0)
      lo = mi;
    else if ((n & (mask[i] << lo)) != 0)
      hi = mi;
  }

  return lo + 1;
}

Quick and dirty test program:

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int highest_bit_unrolled(long long n);
int highest_bit(long long n);

main(int argc, char **argv)
{
  long long n = strtoull(argv[1], NULL, 0);
  int b1, b2;
  long i;
  clock_t start = clock(), mid, end;

  for (i = 0; i < 1000000000; i++)
    b1 = highest_bit_unrolled(n);

  mid = clock();

  for (i = 0; i < 1000000000; i++)
    b2 = highest_bit(n);

  end = clock();

  printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);

  printf("time1 = %d\n", (int) (mid - start));
  printf("time2 = %d\n", (int) (end - mid));
  return 0;
}

Using only -O2, the difference becomes greater. The decision tree is almost four times faster.

I also benchmarked against the naive bit shifting code:

int highest_bit_shift(long long n)
{
  int i = 0;
  for (; n; n >>= 1, i++)
    ; /* empty */
  return i;
}

This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!

On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.

I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.

Tad answered 11/12, 2011 at 7:43 Comment(1)
I'm not saying this isn't good, but Your test program here only tests on the same number, which after 2-3 iterations will have set the branch predictors to their final position and after that they will make perfect branch predictions. The good thing is that with a totally random distribution half the numbers will have close to perfect prediction, namely bit63.Advisable
D
11

Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.

[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:

asm ("bsrl %1, %0" 
     : "=r" (position) 
     : "r" (number));
Dreher answered 22/3, 2009 at 23:46 Comment(6)
To expand: the standard loop solution (shifting left and checking MSB) is probably the most readable. As in all cases involving bit twiddling, the speed of ASM can't be beaten, though there's no point cluttering your code unless necessary. Hacks are an in-between solution - go one way or the other.Dreher
I'd say taking the logarithm would be a perfectly readable solution (check the generated asm to see if the compiler can optimize it to use this asm instruction)Pettifogger
Sometimes the inline ASM solution is slower, depending on the implementation in CPU microcode.Chuff
@rlbound: I can hardly believe that, although I may be mistaken. On any modern CPU one would think that it would get translated to a single instruction....Dreher
@Dreher it's a bit late but.. It's by definition a single instruction, but if it's microcoded as rlbond suggests then that single instruction could decode to a whole bunch of µops internally. That tends to be the case on AMD's microarchitectures, and Intel Atom, but on normal Intel microarchitectures it's a single operation all the way down.Galliwasp
@Galliwasp Thanks for the elaboration. Good to know.Dreher
C
10
unsigned int
msb32(register unsigned int x)
{
        x |= (x >> 1);
        x |= (x >> 2);
        x |= (x >> 4);
        x |= (x >> 8);
        x |= (x >> 16);
        return(x & ~(x >> 1));
}

1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.

From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit

Chuff answered 23/3, 2009 at 3:21 Comment(9)
The above code doesn't answer the question. It returns an unsigned integer where the most significant on bit in x remains on and all the other bits are turned off. The question was to return the position of the most significant on bit.Amortizement
You can then use a De Bruijn sequence approach to find the index of the bit that's set. :-)Breland
@Protagonist, he said in a comment that either suffices.Chuff
This one (from that same page) would do what you need, but it requires an additional function. aggregate.org/MAGIC/#Log2%20of%20an%20IntegerSizable
BSR is fast on Intel CPUs since Core2 at least. LZCNT is fast on AMD CPUs, and gcc uses it for __builtin_clz if it's enabled with -march=native or something (since it's fast on every CPU that supports it). Even on CPUs like AMD Bulldozer-family where BSR is "slow", it's not that slow: 7 m-ops with 4 cycle latency and one per 4c throughput. On Atom, BSR is really slow: 16 cycles. On Silvermont, it's 10 uops with 10 cycle latency. This might be slightly lower latency than BSR on Silvermont, but IDK.Toe
"this is usually faster than the BSR instruction" -- not relevant, since they don't do the same thing.Loisloise
The &~ can be replaced with bitwise xor, ^. At least clang 11 benefits as it doesn't notice the obvious (for humans) at O2 or O3. The compiler can't "see" easily that the bits below the MSB are all set.Ireland
Could anyone tell me how it works? I don't know why it finds the highest bit .Interstate
@Interstate The algorithm basically fills the all bits to the right side of the msb with 1s. So if x = 0101 1100, after the last shift, x = 0111 1111. Then you either xor it again with another shift, or &~, you will get 0100 0000Fabrianne
R
8

What about

int highest_bit(unsigned int a) {
    int count;
    std::frexp(a, &count);
    return count - 1;
}

?

Realize answered 1/12, 2013 at 1:17 Comment(1)
This is a slow (but more portable) version of this answer, which explains why it works.Toe
I
6

Here are some (simple) benchmarks, of algorithms currently given on this page...

The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)

On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...

//////// go.c ///////////////////////////////
// compile with:  gcc go.c -o go -lm
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/***************** math ********************/

#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */    \
  ((unsigned) log2(a))         /* thus: do not use if a <= 0 */  

#define NUM_OF_HIGHESTBITmath(a) ((a)               \
                  ? (1U << POS_OF_HIGHESTBITmath(a))    \
                  : 0)



/***************** clz ********************/

unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */

#define NUM_OF_HIGHESTBITclz(a) ((a)                    \
                 ? (1U << POS_OF_HIGHESTBITclz(a))  \
                 : 0)


/***************** i2f ********************/

double FF;
#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)


#define NUM_OF_HIGHESTBITi2f(a) ((a)                    \
                 ? (1U << POS_OF_HIGHESTBITi2f(a))  \
                 : 0)




/***************** asm ********************/

unsigned OUT;
#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)

#define NUM_OF_HIGHESTBITasm(a) ((a)                    \
                 ? (1U << POS_OF_HIGHESTBITasm(a))  \
                 : 0)




/***************** bitshift1 ********************/

#define NUM_OF_HIGHESTBITbitshift1(a) (({   \
  OUT = a;                  \
  OUT |= (OUT >> 1);                \
  OUT |= (OUT >> 2);                \
  OUT |= (OUT >> 4);                \
  OUT |= (OUT >> 8);                \
  OUT |= (OUT >> 16);               \
      }), (OUT & ~(OUT >> 1)))          \



/***************** bitshift2 ********************/
int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,
             30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,
             16, 7, 26, 12, 18, 6, 11, 5, 10, 9};

#define POS_OF_HIGHESTBITbitshift2(a) (({   \
  OUT = a;                  \
  OUT |= OUT >> 1;              \
  OUT |= OUT >> 2;              \
  OUT |= OUT >> 4;              \
  OUT |= OUT >> 8;              \
  OUT |= OUT >> 16;             \
  OUT = (OUT >> 1) + 1;             \
      }), POS[(OUT * 0x077CB531UL) >> 27])

#define NUM_OF_HIGHESTBITbitshift2(a) ((a)              \
                       ? (1U << POS_OF_HIGHESTBITbitshift2(a)) \
                       : 0)



#define LOOPS 100000000U

int main()
{
  time_t start, end;
  unsigned ui;
  unsigned n;

  /********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/
  printf("math\n");
  for (ui = 0U; ui < 18; ++ui)
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));

  printf("\n\n");

  printf("clz\n");
  for (ui = 0U; ui < 18U; ++ui)
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));

  printf("\n\n");

  printf("i2f\n");
  for (ui = 0U; ui < 18U; ++ui)
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));

  printf("\n\n");

  printf("asm\n");
  for (ui = 0U; ui < 18U; ++ui) {
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));
  }

  printf("\n\n");

  printf("bitshift1\n");
  for (ui = 0U; ui < 18U; ++ui) {
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));
  }

  printf("\n\n");

  printf("bitshift2\n");
  for (ui = 0U; ui < 18U; ++ui) {
    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));
  }

  printf("\n\nPlease wait...\n\n");


  /************************* Simple clock() benchmark ******************/
  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITmath(ui);
  end = clock();
  printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITclz(ui);
  end = clock();
  printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITi2f(ui);
  end = clock();
  printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITasm(ui);
  end = clock();
  printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITbitshift1(ui);
  end = clock();
  printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  start = clock();
  for (ui = 0; ui < LOOPS; ++ui)
    n = NUM_OF_HIGHESTBITbitshift2(ui);
  end = clock();
  printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);

  printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");

  return EXIT_SUCCESS;
}
Intercede answered 8/7, 2011 at 14:20 Comment(1)
Be aware that testing numbers in increasing order can result in algorithms that use conditional branching internally getting an unrealistic benefit from the branch predictor in a modern CPU, since a sequence of nearby numbers will yield similar outcomes for conditional tests.Toodleoo
P
5

Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).

If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.

u8  inline log2 (u32 val)  {
    u8  k = 0;
    if (val > 0x0000FFFFu) { val >>= 16; k  = 16; }
    if (val > 0x000000FFu) { val >>= 8;  k |= 8;  }
    if (val > 0x0000000Fu) { val >>= 4;  k |= 4;  }
    if (val > 0x00000003u) { val >>= 2;  k |= 2;  }
    k |= (val & 2) >> 1;
    return k;
}

Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.

Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.

u8 kTableLog2[256] = {
0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
};

u8 log2_table(u32 val)  {
    u8  k = 0;
    if (val > 0x0000FFFFuL) { val >>= 16; k  = 16; }
    if (val > 0x000000FFuL) { val >>=  8; k |=  8; }
    k |= kTableLog2[val]; // precompute the Log2 of the low byte

    return k;
}

This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.

Procyon answered 16/8, 2016 at 6:33 Comment(3)
Some of the answers are branchless, but this probably will compile with conditional branches. Did you only benchmark with the same value repeatedly, or a simple pattern or something? Branch misprediction is a killer for performance. stackoverflow.com/questions/11227809/…Toe
I test this regularly in my chess engine; this function is very performance critical for bitboard processing. Yes, there are patterns that occur in the effective data set that the CPU ends up taking advantage of. But on the other hand, I can't see testing with ultra-random inputs as being that realistic real world case to optimize for either.Procyon
Depends on your use-case for the function. If you're searching for the first free spot in an allocation bitmap (after finding the first chunk that's has any free spots with a != 0 or != ~0 loop), that's probably pretty random. Many ISAs have a single hardware instruction for this, which runs in constant time (typically 1 or 3 cycle latency, single uop), which is a pretty high bar to compare against. (i.e. without the compiler recognizing a pattern, there's a large gap between __builtin_clz vs. pure C, because C unfortunately never bothered to define a standard function for this CPU op.)Toe
C
4

I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.

int highest_bit(unsigned int a) {
  static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };
  const unsigned int *mask = maskv;
  int l, h;

  if (a == 0) return -1;

  l = 0;
  h = 32;

  do {
    int m = l + (h - l) / 2;

    if ((a >> m) != 0) l = m;
    else if ((a & (*mask << l)) != 0) h = m;

    mask++;
  } while (l < h - 1);

  return l;
}
Coprophilia answered 14/10, 2011 at 12:29 Comment(2)
Since you exit early when a == 0, the test in the else if branch always evaluates to true, so you can simplify it to just else h = m; and get rid of mask :)Toodleoo
(Reasoning: You maintain the invariant that at least one bit in the range [l, h) is 1, and l <= m <= h, so if there is no 1-bit in the range [m, h) then there must be a 1-bit in the remainder, namely [l, m).)Toodleoo
P
4

A version in C using successive approximation:

unsigned int getMsb(unsigned int n)
{
  unsigned int msb  = sizeof(n) * 4;
  unsigned int step = msb;
  while (step > 1)
 {
    step /=2;
    if (n>>msb)
     msb += step;
   else
     msb -= step;
 }
  if (n>>msb)
    msb++;
  return (msb - 1);
}

Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same. ( 4 loops when using "unsigned int")

Panhellenism answered 24/11, 2014 at 9:44 Comment(1)
If you write it with a ternary operator (msb += (n>>msb) ? step : -step;), more compilers are likely to make branchless asm, avoiding branch mispredicts on every step (stackoverflow.com/questions/11227809/…).Toe
I
4

thats some kind of binary search, it works with all kinds of (unsigned!) integer types

#include <climits>
#define UINT (unsigned int)
#define UINT_BIT (CHAR_BIT*sizeof(UINT))

int msb(UINT x)
{
    if(0 == x)
        return -1;

    int c = 0;

    for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
    if(static_cast<UINT>(x >> i))
    {
        x >>= i;
        c |= i;
    }

    return c;
}

to make complete:

#include <climits>
#define UINT unsigned int
#define UINT_BIT (CHAR_BIT*sizeof(UINT))

int lsb(UINT x)
{
    if(0 == x)
        return -1;

    int c = UINT_BIT-1;

    for(UINT i=UINT_BIT>>1; 0<i; i>>=1)
    if(static_cast<UINT>(x << i))
    {
        x <<= i;
        c ^= i;
    }

    return c;
}
Imtiaz answered 21/5, 2015 at 12:32 Comment(1)
Please consider not using ALL_CAPS for typedefs or indeed anything except preprocessor macros. This is a widely accepted convention.Addington
E
3

Expanding on Josh's benchmark... one can improve the clz as follows

/***************** clz2 ********************/

#define NUM_OF_HIGHESTBITclz2(a) ((a)                              \
                  ? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \
                  : 0)

Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.

Emden answered 9/7, 2011 at 8:14 Comment(0)
O
3

As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:

#if defined(__LP64__) || defined(_LP64)
# define BUILD_64   1
#endif

#include <stdio.h>
#include <stdint.h>  /* for uint32_t */

/* CHAR_BIT  (or include limits.h) */
#ifndef CHAR_BIT
#define CHAR_BIT  8
#endif  /* CHAR_BIT */

/* 
 * Find the log base 2 of an integer with the MSB N set in O(N)
 * operations. (on 64bit & 32bit architectures)
 */
int
getmsb (uint32_t word)
{
    int r = 0;
    if (word < 1)
        return 0;
#ifdef BUILD_64
    union { uint32_t u[2]; double d; } t;  // temp
    t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
    t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;
    t.d -= 4503599627370496.0;
    r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
#else
    while (word >>= 1)
    {
        r++;
    }
#endif  /* BUILD_64 */
    return r;
}
Outweigh answered 26/5, 2014 at 8:48 Comment(1)
Wasn't int r; originally defined above the #ifdef BUILD_64 flag? In which case it would not need redefinition within the conditional.Outweigh
S
3

has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:

const auto n = 13UL;
const auto Index = (unsigned long)log2(n);

printf("MSB is: %u\n", Index); // Prints 3 (zero offset)

An n of 0UL needs to be guarded against as well, because:

-∞ is returned and FE_DIVBYZERO is raised

I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi


The corollary to ephemient's gcc only answer is:

const auto n = 13UL;
unsigned long Index;

_BitScanReverse(&Index, n);
printf("MSB is: %u\n", Index); // Prints 3 (zero offset)

The documentation for _BitScanReverse states that Index is:

Loaded with the bit position of the first set bit (1) found

In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:

0 if no set bits were found

Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409

Soupspoon answered 12/1, 2015 at 19:5 Comment(10)
No, _BitScanReverse returns 0 only if the input was 0. This is like x86's BSR instruction, which sets ZF based only on the input, not the output. Interesting that MS words the docs as leaving index unset when no 1 bit is found; that matches the x86 asm behaviour of bsr, too. (AMD documents it as leaving the destination register unmodified on src=0, but Intel just says undefined output even though their CPUs do implement the leave-unmodified behaviour.) This is unlike x86's lzcnt, which gives 32 for not-found.Toe
@PeterCordes _BitScanReverse uses zero-based indexing, thus if n is 1 then the index of the set bit is in fact 0. Unfortunately, as you say if n is 0 then the output is also 0 :( This means there's no way to use the return to distinguish between an n of 1 or 0. That's what I was trying to communicate. Do you think there's a better way to say this?Soupspoon
I think you're talking about how it sets Index. That's not the return value. It returns a boolean that's false if the input was zero (and this is why Index is passed by reference instead of being returned normally). godbolt.org/g/gQKJdE. And I checked: despite the wording of MS's docs, _BitScanReverse doesn't leave Index unset on n==0: you just get whatever value was in the register it happened to use. (Which in your case was probably the same register it used for Index afterward, leading to you seeing a 0).Toe
This question is not tagged c++.Kordofanian
@Kordofanian Thanks, I forgot myself. Given that the question is C we've actually had log2 since C99.Soupspoon
_BitScanReverse returns "Nonzero if Index was set, or 0 if no set bits were found". When n=0 is passed, index is not modified ans thus the answer's code gives undefined result when n=0 instead of 13.Boatman
@Boatman No, Index is zero'd if all bits are unset as well: rextester.com/BHFXS35929 Meaning that testing the return of _BitScanReverse is the only way to know what an out parameter of 0 actually means.Soupspoon
@Boatman Ugh I just reread my answer, it said return when it should have said out param. I've updated.Soupspoon
@JonathanMee: from both a recent initially bad experience, MS's spec and Intel's spec: when the input n is zero, _BitScanReverse finds no bit set, leaves Index unspecified (or is it unchanged, but that's not what I experienced), and returns 0 to warn about that condition. In my experience, actual results will depend on compiler options and surrounding code.Boatman
@Boatman I believe the right choice here is log2 but I agree I'm depending on undocumented behavior on my _BitScanReverse example, which is questionable at best. I've updated my answer with a check that appropriately sets a flag value on Index.Soupspoon
A
3

I know this question is very old, but just having implemented an msb() function myself, I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:

Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01. See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.

It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).

Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).

Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:

template <typename T>
auto msb(T n) -> int
{
    static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,
        "msb<T>(): T must be an unsigned integral type.");

    for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)
    {
        if ((n & mask) != 0)
            return i;
    }

    return 0;
}

Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size. My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.

1The argument goes like this (rough draft): Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.

If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:

(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n

This series of average iterations is actually convergent and has a limit of 2 for n towards infinity

Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.

Ablebodied answered 30/12, 2016 at 1:17 Comment(1)
I don't think it's necessarily a fair assumption that the inputs to msb functions tend to be evenly distributed. In practice, these inputs tend to be interrupt registers or bitboards or some other data structure with unevenly distributed values. For a fair benchmark I think it is safer to assume that the outputs (not the inputs) will be evenly distributed.Cellulosic
C
3

Woaw, that was many answers. I am not sorry for answering on an old question.

int result = 0;//could be a char or int8_t instead
if(value){//this assumes the value is 64bit
    if(0xFFFFFFFF00000000&value){  value>>=(1<<5); result|=(1<<5);  }//if it is 32bit then remove this line
    if(0x00000000FFFF0000&value){  value>>=(1<<4); result|=(1<<4);  }//and remove the 32msb
    if(0x000000000000FF00&value){  value>>=(1<<3); result|=(1<<3);  }
    if(0x00000000000000F0&value){  value>>=(1<<2); result|=(1<<2);  }
    if(0x000000000000000C&value){  value>>=(1<<1); result|=(1<<1);  }
    if(0x0000000000000002&value){  result|=(1<<0);  }
}else{
  result=-1;
}

This answer is pretty similar to another answer... oh well.

Carnarvon answered 27/5, 2017 at 20:35 Comment(4)
Writing the shift amounts as 1<<k is a nice touch. What about the masks? (1 << (1<<k-1)-1<< (1<<k-1)? (most optimal? You compare a superlative?)Oballa
@Oballa If you look at the edits of this question you will see when I added the "optimal" part. I forgot to remove it as I changed my answer. Also I'm not sure why you are are talking about the masks? (What masks? I'm not following you)Carnarvon
((bit)mask are values used to select/clear bits selectively/used in & and &~.) You could replace the hex constants by the likes of ((type)1<<(1<<k))-1<<(1<<k).Oballa
Oh right, I'm using masks, I totally forgot about that. I did answer this a couple of month's ago... - Hmmm, well since it's evaluated during compile time I say it's equivalent to the hex values. However, one is cryptic and one is hexadecimal.Carnarvon
S
2

Think bitwise operators.

I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:

position = sizeof(int)*8
while(!(n & cmp)){ 
   n <<=1;
   position--;
}
Schermerhorn answered 22/3, 2009 at 23:51 Comment(3)
What do you mean converting to a string? The definition of ffs takes an int and returns an int. Where would the conversion be? And what purpose would the conversion serve if we are looking for bits in a word?Mountainous
I didn't know of that function.Schermerhorn
The 8 should be CHAR_BIT. This is very unlikely to be the fastest way, because branch misprediction will happen on exiting the loop unless this is used with the same input repeatedly. Also, for small inputs (lots of zeros), it has to loop a lot. This is like the fallback way that you'd use as the easy-to-verify version in a unit test to compare against optimized versions.Toe
P
2

Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.

The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:

readonly static byte[] msb_tab_15;

// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)
// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.
// The table is compressed by half, so use (value >> 1) for indexing.
static MyStaticInit()
{
    var p = new byte[0x8000];

    for (byte n = 0; n < 16; n++)
        for (int c = (1 << n) >> 1, i = 0; i < c; i++)
            p[c + i] = n;

    msb_tab_15 = p;
}

The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).

Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:

ulong (64-bit) Version

/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(this ulong v)
{
    if ((long)v <= 0)
        return (int)((v >> 57) & 0x40) - 1;      // handles cases v==0 and MSB==63

    int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;
    j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;
    return j + msb_tab_15[v >> (j + 1)];
}

uint (32-bit) Version

/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>
public static int HighestOne(uint v)
{
    if ((int)v <= 0)
        return (int)((v >> 26) & 0x20) - 1;     // handles cases v==0 and MSB==31

    int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;
    return j + msb_tab_15[v >> (j + 1)];
}

Various overloads for the above

public static int HighestOne(long v) => HighestOne((ulong)v);
public static int HighestOne(int v) => HighestOne((uint)v);
public static int HighestOne(ushort v) => msb_tab_15[v >> 1];
public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];
public static int HighestOne(char ch) => msb_tab_15[ch >> 1];
public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];
public static int HighestOne(byte v) => msb_tab_15[v >> 1];

This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.




That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.


The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.

 1  candidates.HighestOne_Tab16A               622,496
 2  candidates.HighestOne_Tab16C               628,234
 3  candidates.HighestOne_Tab8A                649,146
 4  candidates.HighestOne_Tab8B                656,847
 5  candidates.HighestOne_Tab16B               657,147
 6  candidates.HighestOne_Tab16D               659,650
 7  _highest_one_bit_UNMANAGED.HighestOne_U    702,900
 8  de_Bruijn.IndexOfMSB                       709,672
 9  _old_2.HighestOne_Old2                     715,810
10  _test_A.HighestOne8                        757,188
11  _old_1.HighestOne_Old1                     757,925
12  _test_A.HighestOne5  (unsafe)              760,387
13  _test_B.HighestOne8  (unsafe)              763,904
14  _test_A.HighestOne3  (unsafe)              766,433
15  _test_A.HighestOne1  (unsafe)              767,321
16  _test_A.HighestOne4  (unsafe)              771,702
17  _test_B.HighestOne2  (unsafe)              772,136
18  _test_B.HighestOne1  (unsafe)              772,527
19  _test_B.HighestOne3  (unsafe)              774,140
20  _test_A.HighestOne7  (unsafe)              774,581
21  _test_B.HighestOne7  (unsafe)              775,463
22  _test_A.HighestOne2  (unsafe)              776,865
23  candidates.HighestOne_NoTab                777,698
24  _test_B.HighestOne6  (unsafe)              779,481
25  _test_A.HighestOne6  (unsafe)              781,553
26  _test_B.HighestOne4  (unsafe)              785,504
27  _test_B.HighestOne5  (unsafe)              789,797
28  _test_A.HighestOne0  (unsafe)              809,566
29  _test_B.HighestOne0  (unsafe)              814,990
30  _highest_one_bit.HighestOne                824,345
30  _bitarray_ext.RtlFindMostSignificantBit    894,069
31  candidates.HighestOne_Naive                898,865

Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:

[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]
public static extern int RtlFindMostSignificantBit(ulong ul);

It's really too bad, because here's the entire actual function:

    RtlFindMostSignificantBit:
        bsr rdx, rcx  
        mov eax,0FFFFFFFFh  
        movzx ecx, dl  
        cmovne      eax,ecx  
        ret

I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:

public static int HighestOne_Tab8A(ulong v)
{
    if ((long)v <= 0)
        return (int)((v >> 57) & 64) - 1;

    int j;
    j =  /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;
    j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;
    j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;
    return j + msb_tab_8[v >> j];
}

The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:

const ulong N_bsf64 = 0x07EDD5E59A4E28C2,
            N_bsr64 = 0x03F79D71B4CB0A89;

readonly public static sbyte[]
bsf64 =
{
    63,  0, 58,  1, 59, 47, 53,  2, 60, 39, 48, 27, 54, 33, 42,  3,
    61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22,  4,
    62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,
    56, 45, 25, 31, 35, 16,  9, 12, 44, 24, 15,  8, 23,  7,  6,  5,
},
bsr64 =
{
     0, 47,  1, 56, 48, 27,  2, 60, 57, 49, 41, 37, 28, 16,  3, 61,
    54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11,  4, 62,
    46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,
    25, 39, 14, 33, 19, 30,  9, 24, 13, 18,  8, 12,  7,  6,  5, 63,
};

public static int IndexOfLSB(ulong v) =>
    v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;

public static int IndexOfMSB(ulong v)
{
    if ((long)v <= 0)
        return (int)((v >> 57) & 64) - 1;

    v |= v >> 1; v |= v >> 2;  v |= v >> 4;   // does anybody know a better
    v |= v >> 8; v |= v >> 16; v |= v >> 32;  // way than these 12 ops?
    return bsr64[(v * N_bsr64) >> 58];
}

There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.

Predator answered 26/10, 2017 at 13:41 Comment(3)
L1D cache on modern x86 CPUs is only 32kiB. A large LUT is likely to be worse than a small LUT unless you're using the same values repeatedly. If you aren't, you'll get frequent cache misses.Toe
When benchmarking the large LUT, you should 1. read inputs from an array, and 2. randomly permute the array first. That emulates real application behavior: pretty much nobody will be calling this function with the argument coming from a loop induction variable. It will be coming from memory somewhere, and there will be some cache pressure because of it. When you do that, the large LUT is a solid loser. It's dangerous to even suggest it, because people who don't know better may get wrong ideas.Rejuvenate
The deBruijn method is slow as shown because it's one big serial data dependency and nothing there can be parallelized. Such algorithms only save time on in-order scalar CPUs. Try breaking dependencies: ulong v1 = v>>1, v2 = v>>2, v3 = v>>3, vA = (v>>4)|v1|v2|v3, vA4 = vA>>4, vA8 = vA>>8, vA16 = vA>>16, vB = (vA>>24)|vA|vA4|vA8|vA16, v = vB|(vB>>32);. Feel free to check if this is any faster. It shouldn't be slower at least on modern Intel Core, and I expect it would take about 5/6ths of the time.Rejuvenate
A
2

This looks big but works really fast compared to loop thank from bluegsmith

int Bit_Find_MSB_Fast(int x2)
{
    long x = x2 & 0x0FFFFFFFFl;
    long num_even = x & 0xAAAAAAAA;
    long num_odds = x & 0x55555555;

    if (x == 0) return(0);

    if (num_even > num_odds)
    {
        if ((num_even & 0xFFFF0000) != 0) // top 4
        {
            if ((num_even & 0xFF000000) != 0)
            {
                if ((num_even & 0xF0000000) != 0)
                {
                    if ((num_even & 0x80000000) != 0) return(32);
                    else
                        return(30);
                }
                else
                {
                    if ((num_even & 0x08000000) != 0) return(28);
                    else
                        return(26);
                }
            }
            else
            {
                if ((num_even & 0x00F00000) != 0)
                {
                    if ((num_even & 0x00800000) != 0) return(24);
                    else
                        return(22);
                }
                else
                {
                    if ((num_even & 0x00080000) != 0) return(20);
                    else
                        return(18);
                }
            }
        }
        else
        {
            if ((num_even & 0x0000FF00) != 0)
            {
                if ((num_even & 0x0000F000) != 0)
                {
                    if ((num_even & 0x00008000) != 0) return(16);
                    else
                        return(14);
                }
                else
                {
                    if ((num_even & 0x00000800) != 0) return(12);
                    else
                        return(10);
                }
            }
            else
            {
                if ((num_even & 0x000000F0) != 0)
                {
                    if ((num_even & 0x00000080) != 0)return(8);
                    else
                        return(6);
                }
                else
                {
                    if ((num_even & 0x00000008) != 0) return(4);
                    else
                        return(2);
                }
            }
        }
    }
    else
    {
        if ((num_odds & 0xFFFF0000) != 0) // top 4
        {
            if ((num_odds & 0xFF000000) != 0)
            {
                if ((num_odds & 0xF0000000) != 0)
                {
                    if ((num_odds & 0x40000000) != 0) return(31);
                    else
                        return(29);
                }
                else
                {
                    if ((num_odds & 0x04000000) != 0) return(27);
                    else
                        return(25);
                }
            }
            else
            {
                if ((num_odds & 0x00F00000) != 0)
                {
                    if ((num_odds & 0x00400000) != 0) return(23);
                    else
                        return(21);
                }
                else
                {
                    if ((num_odds & 0x00040000) != 0) return(19);
                    else
                        return(17);
                }
            }
        }
        else
        {
            if ((num_odds & 0x0000FF00) != 0)
            {
                if ((num_odds & 0x0000F000) != 0)
                {
                    if ((num_odds & 0x00004000) != 0) return(15);
                    else
                        return(13);
                }
                else
                {
                    if ((num_odds & 0x00000400) != 0) return(11);
                    else
                        return(9);
                }
            }
            else
            {
                if ((num_odds & 0x000000F0) != 0)
                {
                    if ((num_odds & 0x00000040) != 0)return(7);
                    else
                        return(5);
                }
                else
                {
                    if ((num_odds & 0x00000004) != 0) return(3);
                    else
                        return(1);
                }
            }
        }
    }
}
Approachable answered 9/3, 2021 at 20:41 Comment(0)
C
2

There's a proposal to add bit manipulation functions in C, specifically leading zeros is helpful to find highest bit set. See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2827.htm#design-bit-leading.trailing.zeroes.ones

They are expected to be implemented as built-ins where possible, so sure it is an efficient way.

This is similar to what was recently added to C++ (std::countl_zero, etc).

Cairistiona answered 14/12, 2021 at 8:1 Comment(0)
A
2

Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.

Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.

Either way, I gave it a go and the best I could come up with was the following mix of techniques:

uint64_t msb64 (uint64_t n) {

    const uint64_t  M1 = 0x1111111111111111;

    // we need to clear blocks of b=4 bits: log(w/b) >= b
    n |= (n>>1); n |= (n>>2);

    // reverse prefix scan, compiles to 1 mulx
    uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;

    // parallel-reduce each block
    s |= (s>>1);    s |= (s>>2);

    // parallel reduce, 1 imul
    uint64_t c = (s&M1)*(M1<<4);

    // collect last nibble, generate compute count - count%4
    c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits
    c &= (0x0F<<2);    // zero the lowest 2 bits

    // add the missing bits; this could be better solved with a bit of foresight
    // by having the sum already stored
    uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb

    const uint64_t  S = 0x3333333322221100; // last should give -1ul
    return c | ((S>>(4*b)) & 0x03);
}

This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.

I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere. Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.

I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.

For a 64-bit pseudorandom even distribution the results are:

name cycles comment
clz 5.16 builtin intrinsic, fastest
cast 5.18 cast to double, extract exp
ulog2 7.50 reduction + deBrujin
msb64* 11.26 this version
unrolled 19.12 varying performance
obvious 110.49 "obviously" slowest for int64

Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.

My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.

Arcade answered 13/11, 2022 at 22:12 Comment(0)
T
1

Putting this in since it's 'yet another' approach, seems to be different from others already given.

returns -1 if x==0, otherwise floor( log2(x)) (max result 31)

Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.

This is what I use when I don't want to use __builtin_clz because of portability issues.

To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.

int log2floor( unsigned x ){
   static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};
   int r = 0;
   unsigned xk = x >> 16;
   if( xk != 0 ){
       r = 16;
       x = xk;
   }
   // x is 0 .. 0xFFFF
   xk = x >> 8;
   if( xk != 0){
       r += 8;
       x = xk;
   }
   // x is 0 .. 0xFF
   xk = x >> 4;
   if( xk != 0){
       r += 4;
       x = xk;
   }
   // now x is 0..15; x=0 only if originally zero.
   return r + wtab[x];
}
Teen answered 14/10, 2012 at 20:36 Comment(0)
S
1

The code:

    // x>=1;
    unsigned func(unsigned x) {
    double d = x ;
    int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;
    printf( "The left-most non zero bit of %d is bit %d\n", x, p);
    }

Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1

Synergistic answered 29/6, 2015 at 10:21 Comment(5)
uhhhhh. what? how does this function? is it in any way portable?Addington
Codes in the window is portable. The function FYL2X() is a fpu instruction, but may be ported and may found in some FPU/math library.Synergistic
@Addington It works because floating point numbers are normalized ... converting to double shifts the mantissa bits to eliminate leading zeros, and this code extracts the exponent and adjusts it to determine the number of bits shifted. It certainly isn't architecture-independent, but it will probably work on any machine you come across.Loisloise
This is an alternate version of this answer, see there for comments on performance and portability. (Specifically the non-portability of pointer casting for type-punning.) It uses address math to only reload the high 32 bits of the double, which is probably good if it does actually store/reload instead of type-pun some other way, e.g. with a movq instruction like you might get here on x86.Toe
Also note my [comment to that answer], where I offer the dire warning that this method gives the wrong answer for values in (at least) the range [7FFFFFFFFFFFFE00 - 7FFFFFFFFFFFFFFF].Predator
H
1

Note that what you are trying to do is calculate the integer log2 of an integer,

#include <stdio.h>
#include <stdlib.h>

unsigned int
Log2(unsigned long x)
{
    unsigned long n = x;
    int bits = sizeof(x)*8;
    int step = 1; int k=0;
    for( step = 1; step < bits; ) {
        n |= (n >> step);
        step *= 2; ++k;
    }
    //printf("%ld %ld\n",x, (x - (n >> 1)) );
    return(x - (n >> 1));
}

Observe that you can attempt to search more than 1 bit at a time.

unsigned int
Log2_a(unsigned long x)
{
    unsigned long n = x;
    int bits = sizeof(x)*8;
    int step = 1;
    int step2 = 0;
    //observe that you can move 8 bits at a time, and there is a pattern...
    //if( x>1<<step2+8 ) { step2+=8;
        //if( x>1<<step2+8 ) { step2+=8;
            //if( x>1<<step2+8 ) { step2+=8;
            //}
        //}
    //}
    for( step2=0; x>1L<<step2+8; ) {
        step2+=8;
    }
    //printf("step2 %d\n",step2);
    for( step = 0; x>1L<<(step+step2); ) {
        step+=1;
        //printf("step %d\n",step+step2);
    }
    printf("log2(%ld) %d\n",x,step+step2);
    return(step+step2);
}

This approach uses a binary search

unsigned int
Log2_b(unsigned long x)
{
    unsigned long n = x;
    unsigned int bits = sizeof(x)*8;
    unsigned int hbit = bits-1;
    unsigned int lbit = 0;
    unsigned long guess = bits/2;
    int found = 0;

    while ( hbit-lbit>1 ) {
        //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
        //when value between guess..lbit
        if( (x<=(1L<<guess)) ) {
           //printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);
            hbit=guess;
            guess=(hbit+lbit)/2;
            //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
        }
        //when value between hbit..guess
        //else
        if( (x>(1L<<guess)) ) {
            //printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);
            lbit=guess;
            guess=(hbit+lbit)/2;
            //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);
        }
    }
    if( (x>(1L<<guess)) ) ++guess;
    printf("log2(x%ld)=r%d\n",x,guess);
    return(guess);
}

Another binary search method, perhaps more readable,

unsigned int
Log2_c(unsigned long x)
{
    unsigned long v = x;
    unsigned int bits = sizeof(x)*8;
    unsigned int step = bits;
    unsigned int res = 0;
    for( step = bits/2; step>0; )
    {
        //printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);
        while ( v>>step ) {
            v>>=step;
            res+=step;
            //printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);
        }
        step /= 2;
    }
    if( (x>(1L<<res)) ) ++res;
    printf("log2(x%ld)=r%ld\n",x,res);
    return(res);
}

And because you will want to test these,

int main()
{
    unsigned long int x = 3;
    for( x=2; x<1000000000; x*=2 ) {
        //printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));
        printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));
        printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));
        printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));
    }
    return(0);
}
Hals answered 17/10, 2015 at 17:13 Comment(0)
P
1

I assume your question is for an integer (called v below) and not an unsigned integer.

int v = 612635685; // whatever value you wish

unsigned int get_msb(int v)
{
    int r = 31;                         // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.

    while (!(v & 0x80000000) && r--) {   // mask of the highest bit
        v <<= 1;                        // multiply integer by 2.
    }
    return r;                           // will even return -1 if no bit was set, allowing error catch
}

If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly). Please let me know if I forgot anything. I haven't tested it but it should work just fine.

Patronymic answered 24/1, 2018 at 15:55 Comment(3)
v <<= 1 is undefined behavior (UB) when v < 0.Headlight
0x8000000 , maybe you mean an extra 0 there .Gamogenesis
Note that testing if bit 31 of an int32_t variable is 1 can simply use v < 0. No need for a "complicated" v & 0x80000000.Calyx
R
0

My humble method is very simple:

MSB(x) = INT[Log(x) / Log(2)]

Translation: The MSB of x is the integer value of (Log of Base x divided by the Log of Base 2).

This can easily and quickly be adapted to any programming language. Try it on your calculator to see for yourself that it works.

Rattlehead answered 15/6, 2019 at 22:20 Comment(2)
That works if all you're interested is developer efficiency. If you want runtime efficiency, you need alternative algorithm.Antitrades
This can fail due to roundoff error. For example, in CPython 2 and 3, int(math.log((1 << 48) - 1) / math.log(2)) is 48.Lobeline
G
0

Here is a fast solution for C that works in GCC and Clang; ready to be copied and pasted.

#include <limits.h>

unsigned int fls(const unsigned int value)
{
    return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}

unsigned long flsl(const unsigned long value)
{
    return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}

unsigned long long flsll(const unsigned long long value)
{
    return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}

And a little improved version for C++.

#include <climits>

constexpr unsigned int fls(const unsigned int value)
{
    return (unsigned int)1 << ((sizeof(unsigned int) * CHAR_BIT) - __builtin_clz(value) - 1);
}

constexpr unsigned long fls(const unsigned long value)
{
    return (unsigned long)1 << ((sizeof(unsigned long) * CHAR_BIT) - __builtin_clzl(value) - 1);
}

constexpr unsigned long long fls(const unsigned long long value)
{
    return (unsigned long long)1 << ((sizeof(unsigned long long) * CHAR_BIT) - __builtin_clzll(value) - 1);
}

The code assumes that value won't be 0. If you want to allow 0, you need to modify it.

Grogan answered 14/12, 2019 at 11:53 Comment(0)
T
0

Not too happy with the answers on here, I thought I'd give this one a go myself.

Returns 0 if 0 is passed in or index of the bit + 1, for all other values. Handles signed and unsigned values. Is template for completeness.

template<typename T>
inline T ones_mask(unsigned cnt) {
    return ~(~static_cast<T>(0) << cnt);
}
template<typename T>
unsigned most_significant_one_bit_position(T val) {
    //returns 1's index of most significant one bit position, zero returned for zero
    //finds the closest fitting half interval, near optimal half interval search!!!
    if(val == 0) return 0;
    //do half interval search
    unsigned rslt = sizeof(T) * 8;
    unsigned interval = (sizeof(T) * 8) >> 1;
    unsigned mask_size = (sizeof(T) * 8) >> 1;

    //while halfing the interval is non zero
    do {
        bool found = (ones_mask<T>(mask_size) & val) == val;
        //if we have found a matching mask, record this size as the result
        rslt = (found ? mask_size : rslt);
        //half the interval here
        interval >>= 1;
        //increase or decrease the mask size depending on if we found a match
        mask_size += (found ? -interval : interval);
    } while(interval > 0);
    return rslt;
}

A slightly more optimal version, that doesn't make a ones mask, can be found below (still not C code).

#include <type_traits>

template<typename T>
unsigned most_significant_one_bit_position(T val) {
    using unsigned_type = std::make_unsigned_t<T>;
    //returns 1's index of most significant one bit position, zero returned for zero
    //finds the closest fitting half interval, near optimal half interval search!!!
    if(val == 0) return 0;
    auto nval = static_cast<unsigned_type>(val);
    //do half interval search
    unsigned rslt = sizeof(T) * 8;
    unsigned interval = (sizeof(T) * 8) >> 1;
    unsigned mask_size = (sizeof(T) * 8) >> 1;

    //while halfing the interval is non zero
    do {
        bool found = (nval >> mask_size) == 0;
        //if we have found a matching mask, record this size as the result
        rslt = (found ? mask_size : rslt);
        //half the interval here
        interval >>= 1;
        //increase or decrease the mask size depending on if we found a match
        mask_size += (found ? -interval : interval);
    } while(interval > 0);
    return rslt;
}

For the sake of completeness, I decided to make a C version:

unsigned most_significant_one_bit_position_inner(unsigned long long nval, unsigned szeof) {
    //returns 1's index of most significant one bit position, zero returned for zero
    //finds the closest fitting half interval, near optimal half interval search!!!
    if(nval == 0) return 0;
    //do half interval search
    unsigned rslt = szeof * 8;
    unsigned interval = (szeof * 8) >> 1;
    unsigned mask_size = (szeof * 8) >> 1;

    //while halfing the interval is non zero
    do {
        char found = (nval >> mask_size) == 0;
        //if we have found a matching mask, record this size as the result
        rslt = (found != 0 ? mask_size : rslt);
        //half the interval here
        interval >>= 1;
        //increase or decrease the mask size depending on if we found a match
        mask_size += (found != 0 ? -interval : interval);
    } while(interval > 0);
    return rslt;
}

unsigned most_significant_one_bit_position_uchar(unsigned char val) {
    return most_significant_one_bit_position_inner(val, sizeof(unsigned char));
}
unsigned most_significant_one_bit_position_ushort(unsigned short val) {
    return most_significant_one_bit_position_inner(val, sizeof(unsigned short));
}
unsigned most_significant_one_bit_position_uint(unsigned val) {
    return most_significant_one_bit_position_inner(val, sizeof(unsigned));
}
unsigned most_significant_one_bit_position_ulong(unsigned long val) {
    return most_significant_one_bit_position_inner(val, sizeof(unsigned long));
}
unsigned most_significant_one_bit_position_ulonglong(unsigned long long val) {
    return most_significant_one_bit_position_inner(val, sizeof(unsigned long long));
}

unsigned most_significant_one_bit_position_schar(signed char val) {
    unsigned char nval = (unsigned char)val;
    return most_significant_one_bit_position_inner(nval, sizeof(signed char));
}
unsigned most_significant_one_bit_position_short(short val) {
    unsigned short nval = (unsigned short)val;
    return most_significant_one_bit_position_inner(nval, sizeof(short));
}
unsigned most_significant_one_bit_position_int(int val) {
    unsigned int nval = (unsigned int)val;
    return most_significant_one_bit_position_inner(nval, sizeof(int));
}
unsigned most_significant_one_bit_position_long(long val) {
    unsigned long nval = (unsigned long)val;
    return most_significant_one_bit_position_inner(nval, sizeof(long));
}
unsigned most_significant_one_bit_position_longlong(long long val) {
    unsigned long long nval = (unsigned long long)val;
    return most_significant_one_bit_position_inner(nval, sizeof(long long));
}

NOTE haven't tested (any)!!

Tercet answered 21/3, 2023 at 17:51 Comment(6)
I cannot compile this with my C compiler :)Mikkanen
Fair point will need to convert from C++ to C, if you need. Just replace "T" with the desired unsigned type. Or for a bigger challenge use macros instead (on second thoughts don't use macros). :)Tercet
Handles signed and unsigned values: it should be noted that for signed types at least as large as int, ~(~static_cast<T>(0) << cnt); has undefined behavior until C++20 and the equivalent c code ~(~(T)0 << cnt) has undefined behavior even for C23.Mikkanen
Is that so? Compiler says nothing.Tercet
gcc 12.2 gives this warning for the C translation with typedef int T: <source>: In function 'ones_mask': <source>:4:20: warning: left shift of negative value [-Wshift-negative-value] 4 | return ~(~(T)0 << cnt); | ^~ Mikkanen
C doesn't have function overloading.Bellow
F
0

I know I am massively late to the party, but I just came across this question and I wanted to share my template solution for getting the highest bit that is set in the argument as a value and not the exponent, i.e. not the number 4 but the value 2^4 = 16 if bit 4 is the highest bit that is set. This function can be used for any numeric type:

template<typename T>                                                            
T top_bit(T n)                                                                  
{                                                                               
    T ret;                                                                      
    do {                                                                        
        ret = n;                                                                
        n &= n - 1;                                                             
    } while (n);                                                                
    return ret;                                                                 
}  

It works by remembering the last value and eliminating the lowest bit until no bit is left. Thus, the last value is the one with the highest bit set. An argument of zero will return 0.

Filia answered 27/7, 2023 at 16:4 Comment(0)
R
0

For a 4 bit group n, the value can be computed in 3 instructions using a 32bit unsigned value as a lookup table with 16 entries of two bits each:

(0xffffaa50 >> (n << 1)) & 3

This can be used to trim the end of the divide and conquer methods.

size_t blog32(uint32_t v) {
    size_t r = 0, t;
    t = (0 != (v >> 16)) << 4; v >>= t; r |= t;
    t = (0 != (v >>  8)) << 3; v >>= t; r |= t;
    t = (0 != (v >>  4)) << 2; v >>= t; r |= t;
    return r + ((0xffffaa50 >> (v << 1)) & 3);
}

For 64bit, add one more level:

size_t blog64(uint64_t v) {
    size_t r = 0, t;
    t = (0 != (v >> 32)) << 5; v >>= t; r |= t;
    t = (0 != (v >> 16)) << 4; v >>= t; r |= t;
    t = (0 != (v >>  8)) << 3; v >>= t; r |= t;
    t = (0 != (v >>  4)) << 2; v >>= t; r |= t;
    return r + ((0xffffaa50 >> (v << 1)) & 3);
}
Rigveda answered 8/10, 2023 at 18:56 Comment(0)
L
0

C++ 20: std::bit_width()

std::cout << std::bit_width(10u); // prints 4 as of 00001010b

Note that it accepts only unsigned types. Deduct 1 to get the zero based position to answer your question.

Leonilaleonine answered 26/3 at 4:1 Comment(1)
this question is tagged CApis

© 2022 - 2024 — McMap. All rights reserved.