Find most significant bit (left-most) that is set in a bit array

Asked 6/4, 2010 at 23:42 Answered 30/5, 2019 at 18:55

I have a bit array implementation where the 0th index is the MSB of the first byte in an array, the 8th index is the MSB of the second byte, etc...

What's a fast way to find the first bit that is set in this bit array? All the related solutions I have looked up find the first least significant bit, but I need the first most significant one. So, given 0x00A1, I want 8 (since it's the 9th bit from the left).

Monopoly answered 6/4, 2010 at 23:42 Comment(7)

Isn't bit 7 the most significant bit set in 0x00a1 (assuming the lsb is bit 0)? – Academia 6/4, 2010 at 23:50

Is your bit array of arbitrary length, or does it fit into a machine word? – Academia 6/4, 2010 at 23:51

I was counting from the left. In binary I get "0000|0000|1010|0001", so that's the 9th bit, with index 8. i did make a mistake though, it should be 8, not 9. – Monopoly 6/4, 2010 at 23:53

What interface do you have to your bit array? What are the operations you can perform on it? – Terceira 6/4, 2010 at 23:53

@jemfinch: it's a C array of chars – Monopoly 7/4, 2010 at 1:26

There's another page with details already... stackoverflow.com/questions/671815/… – Fiche 8/7, 2011 at 14:24

Finding the most significant set bit is equivalent to integer binary logarithm – Sheep 7/4, 2018 at 15:6

GCC has __builtin_clz that translates to BSR on x86/x64, CLZ on ARM, etc. and emulates the instruction if the hardware does not implement it.
Visual C++ 2005 and up has _BitScanReverse.

Goatsucker answered 7/4, 2010 at 0:19 Comment(9)

Watch out for undefined behavior when the argument is 0. – Brusquerie 21/6, 2017 at 23:35

Yes. And in this case, "undefined behavior" means "returns a nondeterministically random number." – Friction 24/11, 2017 at 22:4

@Friction Or it may enter infinite loop, scanning for the nonexistent 1. Nothing prevents the compiler from doing anything when its spec/manual says "undefined behavior". – Wickliffe 3/10, 2018 at 21:44

@minmaxavg: __builtin_clz with an input of 0 is not C/C++ "Undefined Behaviour". The documentation says "the result is undefined", not the behaviour. Knowing how GCC works, and the x86 reason why that caveat is present, I'm sure they don't mean UB. Specifically on x86, it's whatever value was in the destination register before the instruction ran. (The asm instruction leaves the destination unmodified for input=0. Intel documents it as an undefined value.) see: VS: unexpected optimization behavior with _BitScanReverse64 intrinsic for details. – Cognizant 19/2, 2019 at 5:6

As you can see from my example below, __builtin_clz(0) returns 0x9ab07060 on my machine. That means that any practical use of __builtin_clz(0) requires a comparison against 0 as a sanity check on the inputs. That in turn means that any practical use of __builtin_clz cannot be branchless. – Friction 13/6, 2019 at 1:24

__builtin_clz(x) & -!!x is branchless – Hatchway 31/10, 2019 at 20:26

Functions __builtin_clz and __builtin_clzl appear to be available in Clang (checking on v9.0.1). – Unthrone 24/3, 2020 at 14:21

@johnwbyrd: sometimes you can do __builtin_clz(x | 1) to unconditionally set the last bit that clz might look at. But you're right that you can't usefully branch on the integer result of clz, only on the input, if 0 is possible. The intrinsic doesn't expose the boolean FLAGS result of the x86 asm bsr instruction for you to check that way. (Unlike most instructions, BSR sets ZF based on its input being zero. This might be part of why it's slower than tzcnt on AMD). – Cognizant 12/6, 2020 at 16:38

phernost: Surprisingly, __builtin_clz(x) & -!!x is not guaranteed to be branchless. To understand why, you have to see how LLVM and other compilers generate code for the ! operator. The formal specification for the ! operator says: "The result of the logical negation operator ! is 0 if the value of its operand compares unequal to 0, 1 if the value of its operand compares equal to 0. The result has type int. The expression !E is equivalent to (0==E)." Try entering the expression -!!x into godbolt.org for various compilers, and you will see that the generated code is not always branchless. – Friction 7/8, 2020 at 19:30

tl:dr; For 32 bits, use de Bruijn multiplication.

It's the "fastest" portable algorithm. It is substantially faster and more correct than all the other portable 32-bit MSB algorithms in this thread.

The de Bruijn algorithm also returns a correct result when the input is zero. The __builtin_clz and _BitScanReverse instructions return incorrect results when the input is zero.

On Windows x86-64, de Bruijn multiplication runs at a speed comparable to the equivalent (flawed) Windows function, with a performance difference of only around 3%.

Here's the code.

u32 msbDeBruijn32( u32 v )
{
    static const int MultiplyDeBruijnBitPosition[32] =
    {
        0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
        8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
    };

    v |= v >> 1; // first round down to one less than a power of 2
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;

    return MultiplyDeBruijnBitPosition[( u32 )( v * 0x07C4ACDDU ) >> 27];
}

All the other answers in this thread either run much more poorly than their authors suggest, or don't calculate the result correctly, or both. Let's benchmark them all, and let's verify that they do what they claim to do.

Here's a simple C++11 harness to test all these implementations. It compiles clean on Visual Studio but should work on all modern compilers. It allows you to run the benchmark in performance mode (bVerifyResults = false) and in checking mode (bVerifyResults = true).

Here are the results in verification mode:

Verification failed for msbNative64: input was 0; output was 818af060; expected 0
Verification failed for msbFfs: input was 22df; output was 0; expected d
Verification failed for msbPerformanceJunkie32: input was 0; output was ffffffff; expected 0
Verification failed for msbNative32: input was 0; output was 9ab07060; expected 0

The "performance junkie" and the Microsoft native implementations do different things when the input is zero. msbPerformanceJunkie32 produces -1, and Microsoft's _BitScanReverse produces a random number, consistent with the underlying hardware instruction. Also the msbPerformanceJunkie32 implementation produces a result that is off by one from all the other answers.

Here are the results in performance mode, running on my i7-4600 laptop, compiled in release mode:

msbLoop64 took 2.56751 seconds               
msbNative64 took 0.222197 seconds            

msbLoop32 took 1.43456 seconds               
msbFfs took 0.525097 seconds                 
msbPerformanceJunkie32 took 1.07939 seconds  
msbDeBruijn32 took 0.224947 seconds          
msbNative32 took 0.218275 seconds

The de Bruijn version beats the other implementations soundly because it is branchless, and therefore it runs well against inputs that produce an evenly distributed set of outputs. All the other versions are slower against arbitrary inputs because of the penalties of branch misprediction on modern CPUs. The smbFfs function produces incorrect results so it can be ignored.

Some of the implementations work on 32 bit inputs, and some work on 64 bit inputs. A template will help us compare apples to apples, regardless of the input size.

Here's the code. Download and run the benchmarks yourself if you like.

#include <iostream>
#include <chrono>
#include <random>
#include <cassert>
#include <string>
#include <limits>

#ifdef _MSC_VER
#define MICROSOFT_COMPILER 1
#include <intrin.h>
#endif // _MSC_VER

const int iterations = 100000000;
bool bVerifyResults = false;
std::random_device rd;
std::default_random_engine re(rd());
typedef unsigned int u32;
typedef unsigned long long u64;

class Timer
{
public:
    Timer() : beg_(clock_::now()) {}
    void reset() {
        beg_ = clock_::now();
    }
    double elapsed() const {
        return std::chrono::duration_cast<second_>
            (clock_::now() - beg_).count();
    }

private:
    typedef std::chrono::high_resolution_clock clock_;
    typedef std::chrono::duration<double, std::ratio<1> > second_;
    std::chrono::time_point<clock_> beg_;
};

unsigned int msbPerformanceJunkie32(u32 x)
{
    static const unsigned int bval[] =
    { 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4 };
    unsigned int r = 0;
    if (x & 0xFFFF0000) {
        r += 16 / 1;
        x >>= 16 / 1;
    }
    if (x & 0x0000FF00) {
        r += 16 / 2;
        x >>= 16 / 2;
    }
    if (x & 0x000000F0) {
        r += 16 / 4;
        x >>= 16 / 4;
    }
    return r + bval[x];
}

#define FFS(t)  \
{ \
register int n = 0; \
if (!(0xffff & t)) \
n += 16; \
if (!((0xff << n) & t)) \
n += 8; \
if (!((0xf << n) & t)) \
n += 4; \
if (!((0x3 << n) & t)) \
n += 2; \
if (!((0x1 << n) & t)) \
n += 1; \
return n; \
}

unsigned int msbFfs32(u32 x)
{
    FFS(x);
}

unsigned int msbLoop32(u32 x)
{
    int r = 0;
    if (x < 1) return 0;
    while (x >>= 1) r++;
    return r;
}

unsigned int msbLoop64(u64 x)
{
    int r = 0;
    if (x < 1) return 0;
    while (x >>= 1) r++;
    return r;
}

u32 msbDeBruijn32(u32 v)
{
    static const int MultiplyDeBruijnBitPosition[32] =
    {
        0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
        8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
    };

    v |= v >> 1; // first round down to one less than a power of 2
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;

    return MultiplyDeBruijnBitPosition[(u32)(v * 0x07C4ACDDU) >> 27];
}

#ifdef MICROSOFT_COMPILER
u32 msbNative32(u32 val)
{
    unsigned long result;
    _BitScanReverse(&result, val);
    return result;
}
u32 msbNative64(u64 val)
{
    unsigned long result;
    _BitScanReverse64(&result, val);
    return result;
}
#endif // MICROSOFT_COMPILER

template <typename InputType>
void test(unsigned int msbFunc(InputType),
    const std::string &name,
    const std::vector< InputType > &inputs,
    std::vector< unsigned int > &results,
    bool bIsReference = false
)
{
    if (bIsReference)
    {
        int i = 0;
        for (int i = 0; i < iterations; i++)
            results[i] = msbFunc(inputs[i]);
    }
    InputType result;
    if (bVerifyResults)
    {
        bool bNotified = false;
        for (int i = 0; i < iterations; i++)
        {
            result = msbFunc(inputs[i]);
            if ((result != results[i]) && !bNotified)
            {
                std::cout << "Verification failed for " << name << ": "
                    << "input was " << std::hex << inputs[i]
                    << "; output was " << result
                    << "; expected " << results[i]
                    << std::endl;
                bNotified = true;
            }
        }
    }
    else
    {
        Timer t;
        for (int i = 0; i < iterations; i++)
        {
            result = msbFunc(inputs[i]);
        }
        double elapsed = t.elapsed();
        if ( !bIsReference )
            std::cout << name << " took " << elapsed << " seconds" << std::endl;
        if (result == -1.0f)
            std::cout << "this comparison only exists to keep the compiler from " <<
            "optimizing out the benchmark; this branch will never be called";
    }
}

void main()
{
    std::uniform_int_distribution <u64> dist64(0,
        std::numeric_limits< u64 >::max());
    std::uniform_int_distribution <u32> shift64(0, 63);
    std::vector< u64 > inputs64;
    for (int i = 0; i < iterations; i++)
    {
        inputs64.push_back(dist64(re) >> shift64(re));
    }
    std::vector< u32 > results64;
    results64.resize(iterations);

    test< u64 >(msbLoop64, "msbLoop64", inputs64, results64, true);
    test< u64 >(msbLoop64, "msbLoop64", inputs64, results64, false);
#ifdef MICROSOFT_COMPILER
    test< u64 >(msbNative64, "msbNative64", inputs64, results64, false);
#endif // MICROSOFT_COMPILER
    std::cout << std::endl;

    std::uniform_int_distribution <u32> dist32(0,
        std::numeric_limits< u32 >::max());
    std::uniform_int_distribution <u32> shift32(0, 31);
    std::vector< u32 > inputs32;
    for (int i = 0; i < iterations; i++)
        inputs32.push_back(dist32(re) >> shift32(re));
    std::vector< u32 > results32;
    results32.resize(iterations);


    test< u32 >(msbLoop32, "msbLoop32", inputs32, results32, true);

    test< u32 >(msbLoop32, "msbLoop32", inputs32, results32, false);
    test< u32 >(msbFfs32, "msbFfs", inputs32, results32, false);
    test< u32 >(msbPerformanceJunkie32, "msbPerformanceJunkie32",
        inputs32, results32, false);
    test< u32 >(msbDeBruijn32, "msbDeBruijn32", inputs32, results32, false);
#ifdef MICROSOFT_COMPILER
    test< u32 >(msbNative32, "msbNative32", inputs32, results32, false);
#endif // MICROSOFT_COMPILER
}

Friction answered 30/7, 2015 at 7:52 Comment(16)

Nice work, but you're currently including the initialisation work done by msbLoop32 in its timing, meaning it appears twice as slow as it really is. – Fennec 22/11, 2017 at 18:39

I'm also interested to know how you can get away with multiplying by a v that is one less than a power of 2. The linked PDF only explains why the multiplication corresponds to a shift when it is a power of 2, so I would have thought adding 1 would be necessary. – Fennec 22/11, 2017 at 18:44

What initialization work do you perceive msbLoop32 to be doing? – Friction 24/11, 2017 at 20:25

The test() instantiation for msbLoop32 (and also for msbLoop64, I notice now) are each called with bIsReference set to true, and Timer t is defined before this initialisation step (which begins with if (bIsReference), so it includes this initialisation in the duration it measures. – Fennec 24/11, 2017 at 20:34

Thanks for those comments. I've changed the code so that reference comparisons are no longer benchmarked, and the timer is now started and stopped more correctly. The benchmarks changed trivially but the high-level results remain the same; updated benchmarks are above. Feel free to improve the answer further. – Friction 24/11, 2017 at 21:10

Thanks. I'm still interested to know how you got away with using one less than a power of 2 in the multiplication with the magic constant! – Fennec 24/11, 2017 at 21:36

stackoverflow.com/questions/7365562/… – Friction 24/11, 2017 at 22:1

It may help to think of all those shifts as "fill all the bits lower than the most significant bit with ones." – Friction 26/11, 2017 at 1:26

Thanks for the link! The top-voted answer there just does exhaustive search to find the "modified de Bruijn" constant; this might take centuries for, say, 64-bit integers. In contrast, "ordinary de Bruijn" constants can be found in milliseconds by looking for an Eulerian cycle in the de Bruijn graph of dimension n-1. (I've left a comment on the answer there.) So it remains a mystery as to whether the original author of the "modified de Bruijn" code you presented here used exhaustive search to find that constant, or some clever insight... – Fennec 26/11, 2017 at 15:26

I assumed it was a brute force search. That would explain why there is no comparable perfect de Bruijn number for a 64-bit integer MSB search (as of this writing). I postulate without proof that it may be possible to find some 8, 9, or 10 bit de Bruijn tables that would do the job for the 64-bit MSB case. – Friction 28/11, 2017 at 3:7

This benchmark doesn't pass the smell test: doesn't it seem weird that 11 shifts + OR operations, a multiply and a table lookup is almost exactly the same speed (within measurement error, pretty much) as the "native" solutions that compile down to a single bsr or lzcnt instruction? What happens is that as long as the tested routine is "fast enough" the benchmark just tests the performance of a dense loop of indirect/call branch. If you fix the benchmark, you'll find that the native solutions are, in their raw form, 3 to 10 times faster than the deBruijn solutions. – Lordan 5/10, 2018 at 17:10

Yes, the native solutions may give the wrong answers for an all-zero input, but if this is possible in your use case you can fix it with an additional check or a bit of math, which still leaves it much faster. The problem with the "all zeros" input was based on the actual x86 bsr and bsf instructions, which have an undefined result for zero input, but newer CPUs have tzcnt and lzcnt instructions which solve this (although for lzcnt the sense of the answer is reversed so it's not a drop-in replacement). It can be annoying to get the compiler to emit these, however. – Lordan 5/10, 2018 at 17:14

This is great, thanks! I noticed (after some months of using the code!) that the comment that says "round down to one less than a power of 2" should say "round up to one less than a power of 2": it turns on bits (by ORing) so the value increases. – Isometropia 30/11, 2018 at 18:47

BeeOnRope: Way too much armchair benchmarking in this thread. Show us your code. – Friction 2/12, 2018 at 23:11

Why should in the input of zero give an output of zero? Bit 0 isn't set. Asking for the least significant bit when the number is zero doesn't make sense so a method isn't wrong if it gives something else for zero. – Farrahfarrand 20/8, 2019 at 10:33

me': by that logic, dividing by zero should return a random number. – Friction 23/8, 2019 at 0:33

As a performance junkie I have tried a ton of variations for MSB set, the following is the fastest I have come across,

unsigned int msb32(unsigned int x)
{
    static const unsigned int bval[] =
    {0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4};

    unsigned int r = 0;
    if (x & 0xFFFF0000) { r += 16/1; x >>= 16/1; }
    if (x & 0x0000FF00) { r += 16/2; x >>= 16/2; }
    if (x & 0x000000F0) { r += 16/4; x >>= 16/4; }
    return r + bval[x];
}

Kindhearted answered 23/4, 2012 at 1:14 Comment(2)

This code is about four times slower than de Bruijn multiplication, across randomly distributed inputs. Additionally, this code produces a result that is off by one from the other answers; namely, msb( 1 ) == 1, unlike the other definitions, for which msb( 1 ) == 0. – Friction 29/5, 2017 at 20:49

That's one of the defects of StackOverflow and other "most popular answer wins" type sites. The top answer is always the answer that Everyman thinks is right. But Everyman is not always right. Crowd wisdom is no substitute for benchmarking. – Friction 2/8, 2017 at 1:30

There are multiple ways to do this, and the relative performance of the different implementations is somewhat machine-dependent (I happen to have benchmarked this to some extent for a similar purpose). On some machines there's even a built-in instruction for this (use one if available and portability can be dealt with).

Check out some implementations here (under “integer log base 2”). If you are using GCC, check out the functions __builtin_clz and __builtin_clzl (which do this for non-zero unsigned ints and unsigned longs, respectively). The “clz” stands for “count leading zeros”, which is yet another way to describe the same problem.

Of course, if your bit array does not fit into a suitable machine word, you need to iterate over words in the array to find the first non-zero word and then perform this calculation only on that word.

Berke answered 6/4, 2010 at 23:59 Comment(1)

+1 for pointing out that __builtin_clz and __builtin_clzl are undefined for 0 inputs (as backed up by the GCC documentation). – Fennec 22/11, 2017 at 17:49

Look up the BSR (Bit scan reverse) x86 asm instruction for the fastest way to do this. From Intel's doc: Searches the source operand (second operand) for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand (first operand).

Montiel answered 6/4, 2010 at 23:48 Comment(0)

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

Palanquin answered 6/4, 2010 at 23:58 Comment(1)

Heh, I have the exact same URL, #IntegerLogObvious included, in my answer. – Berke 7/4, 2010 at 0:4

I have worked with a number of functions to get the most significant bit, but problems generally arise moving between 32 and 64 bit numbers or moving between x86_64 and x86 boxes. The functions __builtin_clz, __builtin_clzl and __builtin_clzll work well for 32/64 bit numbers and across x86_64 and x86 machines. However, three functions are required. I have found a simple MSB that relies on right-shift that will handle all cases for positive numbers. At least for the use I make of it, it has succeeded where others have failed:

int
getmsb (unsigned long long x)
{
    int r = 0;
    if (x < 1) return 0;
    while (x >>= 1) r++;
    return r;
}

By designating input as unsigned long long it can handle all number classes from unsigned char to unsigned long long and given the standard definition, it is compatible across x86_64 and x86 builds. The case for 0 is defined to return 0, but can be changed as required. A simple test and output are:

int
main (int argc, char *argv[]) {

    unsigned char c0 = 0;
    unsigned char c = 216;
    unsigned short s = 1021;
    unsigned int ui = 32768;
    unsigned long ul = 3297381253;
    unsigned long long ull = 323543844043;

    int i = 32767;

    printf ("  %16u  MSB : %d\n", c0, getmsb (c0));
    printf ("  %16u  MSB : %d\n", c, getmsb (c));
    printf ("  %16u  MSB : %d\n", s, getmsb (s));
    printf ("  %16u  MSB : %d\n", i, getmsb (i));
    printf ("  %16u  MSB : %d\n", ui, getmsb (ui));
    printf ("  %16lu  MSB : %d\n", ul, getmsb (ul));
    printf ("  %16llu  MSB : %d\n", ull, getmsb (ull));

    return 0;
}

Output:

             0  MSB : 0
           216  MSB : 7
          1021  MSB : 9
         32767  MSB : 14
         32768  MSB : 15
    3297381253  MSB : 31
  323543844043  MSB : 38

NOTE: for speed considerations, using a single function to accomplish the same thing centered around __builtin_clzll is still faster by a factor of about 6.

Hobbes answered 2/6, 2014 at 8:25 Comment(0)

x86 has a BSR instruction that returns a bit-index (rather than the count of leading zeros above it).

But unfortunately there's no portable intrinsic that efficiently exposes it for all compilers. GNU C provides __builtin_clz, but unsigned bitidx = 31 - __builtin_clz(x); doesn't optimize back to just BSR with current GCC and ICC. (It does with clang, which proves that the expression is equivalent so it could).

The following defines BSR32() and BSR64() macros or functions that compile efficiently to just a bsr instruction on x86. (Producing a garbage result if the input was zero. There's no way with intrinsics to take advantage of the asm instruction's behaviour of leaving the destination unmodified for input=0.)

Portability to non-x86 would take some additional #ifdef e.g. to fall back to 31-__builtin_clz. Most non-x86 ISAs, if they have a leading-zero bitscan at all, count leading zeros instead of giving you the bit-index. That's why GNU C defines __builtin_clz as the portable builtin. (If there's no HW support on the target system, the builtin will compile to software emulation, usually calling a libgcc helper function.)

#include <stdint.h>

// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
    #ifdef __INTEL_COMPILER
        typedef unsigned int bsr_idx_t;
    #else
        #include <intrin.h>   // MSVC
        typedef unsigned long bsr_idx_t;
    #endif

    static inline
    unsigned BSR32(unsigned long x){
        bsr_idx_t idx;
        _BitScanReverse(&idx, x); // ignore bool retval
        return idx;
    }
    static inline
    unsigned BSR64(uint64_t x) {
        bsr_idx_t idx;
        _BitScanReverse64(&idx, x); // ignore bool retval
        return idx;
    }
#elif defined(__GNUC__)

  #ifdef __clang__
    static inline unsigned BSR64(uint64_t x) {
        return 63-__builtin_clzll(x);
      // gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
    }
  #else
    #define BSR64 __builtin_ia32_bsrdi
  #endif

    #include <x86intrin.h>
    #define BSR32(x) _bit_scan_reverse(x)

#endif

bsf probably doesn't need as much help for compilers, because the builtin matches the asm instruction's behaviour of returning the bit-index of the LSB, i.e. the count of trailing zeros.

A test caller unsigned test32(unsigned x) { return BSR32(x); } inlines it to 1 instruction on all the major x86 compilers, on the Godbolt compiler explorer. BSR64 inlines the same way, to a 64-bit operand-size version. See also Is there an x86/x86_64 instruction which zeros all bits below the Most Significant Bit? for example use-cases.

;; x64 MSVC 19.16 -O2
unsigned int test32(unsigned int) PROC                                    ; test32, COMDAT
        bsr     eax, ecx
        ret     0
unsigned int test32(unsigned int) ENDP                                    ; test32

# clang -O3 -march=haswell   is too "smart?" for its own good:
test32(unsigned int):
        lzcnt   eax, edi
        xor     eax, 31
        ret

# gcc8.2 -O3 -march=haswell
test32(unsigned int):
        bsr     eax, edi
        ret

# ICC19 -O3 -march=haswell
test32(unsigned int):
        bsr       eax, edi                                      #15.9
        ret                                                     #41.12

The point of this is to avoid slow code from the portable (to non-MSVC) version:

#ifdef __GNUC__
unsigned badgcc(uint64_t x) {
    return 63 - __builtin_clzll(x);
}
#endif

Without -march=haswell we get just BSR from clang, but:

# gcc8.2 -O3
badgcc(unsigned long):
        bsr     rdi, rdi
        mov     eax, 63
        xor     rdi, 63
        sub     eax, edi
        ret

# ICC19.0.1 -O3
badgcc(unsigned long):
        mov       rax, -1                                       #46.17
        bsr       rdx, rdi                                      #46.17
        cmove     rdx, rax                                      #46.17
        neg       rdx                                           #46.17
        add       rdx, 63                                       #46.17
        neg       edx                                           #46.17
        add       edx, 63                                       #46.17
        mov       eax, edx                                      #46.17
        ret                                                     #46.17

That's just nasty. (Interesting to see that ICC is doing a CMOV to produce -1 if the input is zero. BSR sets ZF according to its input, unlike most instructions which set flags according to the result.)

With -march=haswell (or otherwise enabling use of BMI1 instructions), it's not as bad, but still not as good as just BSR. Modulo output dependencies, which compilers mostly work to avoid for lzcnt but strangely not for BSR. (Where the output dependency is a true dependency, because of the input=0 behaviour.) Why does breaking the "output dependency" of LZCNT matter?

Cognizant answered 19/2, 2019 at 6:36 Comment(1)

Update on this: clang8.0 seems to have a regression here, not optimizing away the XOR flipping for 63 - __builtin_clzll() – Cognizant 25/8, 2019 at 1:41

If you're using x86, you can beat practically any byte-by-byte or word-by-word solution using the SSE2 operations, combined with the find-first-bit instructions, which (in the gcc world) are pronounced "ffs" for the lowest bit and "fls" for the highest bit. Pardon me for having trouble (!@#$%^) formatting "C" code in an answer; check out: http://mischasan.wordpress.com/2011/11/03/sse2-bit-trick-ffsfls-for-xmm-registers/

Antimonic answered 5/11, 2011 at 0:48 Comment(0)

Two best ways I know to do this in pure C:

First linear-search the byte/word array to find the first byte/word that's nonzero, then do an unrolled binary-search of the byte/word you find.

if (b>=0x10)
  if (b>=0x40)
    if (b>=0x80) return 0;
    else return 1;
  else
    if (b>=0x20) return 2;
    else return 3;
else
  if (b>=0x4)
    if (b>=0x8) return 4;
    else return 5;
  else
    if (b>=0x2) return 6;
    else return 7;

3 (BTW that's log2(8)) conditional jumps to get the answer. On modern x86 machines the last one will be optimized to a conditional mov.

Alternatively, use a lookup table to map the byte to the index of the first bit that's set.

A related topic you might want to look up is integer log2 functions. If I recall, ffmpeg has a nice implementation.

Edit: You can actually make the above binary search into a branchless binary search, but I'm not sure if it would be more efficient in this case...

Surprint answered 8/7, 2010 at 6:42 Comment(0)

Not the fastest, but it works...

//// C program
#include <math.h>

#define POS_OF_HIGHESTBIT(a) /* 0th position is the Least-Signif-Bit */    \
((unsigned) log2(a))         /* thus: do not use if a <= 0 */  

#define NUM_OF_HIGHESTBIT(a) ((!(a))          \
        ? 0 /* no msb set*/                   \
        : (1 << POS_OF_HIGHESTBIT(a) ))
// could be changed and optimized, if it is known that the following NEVER holds: a <= 0



int main()
{
  unsigned a = 5; // 0b101
  unsigned b = NUM_OF_HIGHESTBIT(a); // 4 since 4 = 0b100
  return 0; 
}

Albin answered 28/6, 2011 at 15:41 Comment(0)

Here's a code snippet explaining __builtin_clz()

////// go.c ////////
#include <stdio.h>

unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);
#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */

#define NUM_OF_HIGHESTBITclz(a) ((a)                                \
                             ? (1U << POS_OF_HIGHESTBITclz(a))      \
                             : 0)


int main()
{
  unsigned ui;

  for (ui = 0U; ui < 18U; ++ui)
    printf("%i \t %i\n", ui, NUM_OF_HIGHESTBITclz(ui));

  return 0;
}

Casing answered 7/7, 2011 at 22:22 Comment(0)

I'll add one!

typedef unsigned long long u64;
typedef unsigned int       u32;
typedef unsigned char      u8;


u8 findMostSignificantBit (u64 u64Val)
{
  u8 u8Shift;
  u8 u8Bit = 0;

  assert (u64Val != 0ULL);

  for (u8Shift = 32 ; u8Shift != 0 ; u8Shift >>= 1)
  {
    u64 u64Temp = u64Val >> u8Shift;
    if (u64Temp)
    {
      u8Bit |= u8Shift; // notice not using +=
      u64Val = u64Temp;
    }
  }

  return u8Bit;
}

Of course, this is working on a 64 bit number (unsigned long long), and not an array. Also, plenty of people have pointed to inbuilt g++ functions I was not aware of. How interesting.

Anyhow, this finds the most significant bit in 6 iterations and gives an assert if you passed 0 to the function. Not the best function to use if you have access to an instruction of the chipset.

I also am also using |= instead of += because these are always powers of two, and OR is (classically) faster than addition. Since I'm only adding unique powers of 2 together, I never have roll over.

This is a binary search which means it always finds the result in 6 iterations.

Again, this is better:

u8 findMostSignificantBit2 (u64 u64Val)
{
  assert (u64Val != 0ULL);

  return (u8) (__builtin_ctzll(u64Val));
}

Gorski answered 18/7, 2017 at 7:1 Comment(0)

Here's a simple, brute force algorithm for an arbitrary-sized array of bytes:

int msb( unsigned char x);  // prototype for function that returns 
                            //  most significant bit set

unsigned char* p;

for (p = arr + num_elements; p != arr;) {
    --p;
    if (*p != 0) break;
}

// p is with pointing to the last byte that has a bit set, or
//  it's pointing to the first byte in the array

if (*p) {
    return ((p - arr) * 8) + msb( *p);
}

// what do you want to return if no bits are set?
return -1;

I'll leave it as a an exercise for the reader to come up with an appropriate msb() function as well as the optimization to work on int or long long sized chinks of data.

Academia answered 7/4, 2010 at 0:3 Comment(0)

Um, your tag indicates 32bit but it looks like the values that you're using are 16 bit. If you did mean 32 bit, then I think the answer for 0x00a1 ought to be 24 and not 8.

Assuming that you are looking for the MSB bit index from the left hand side and you know that you will only be dealing with uint32_t's, here's the obvious, simple-minded algorithm:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

int main()
{
    uint32_t test_value = 0x00a1;
    int i;

    for (i=0; i<32; ++i)
    {
        if (test_value & (0x80000000 >> i))
        {
            printf("i = %d\n", i);
            exit(0);
        }
    }

    return 0;
}

Reclaim answered 7/4, 2010 at 0:9 Comment(0)

For java I use this:

static public final int msb(int n) {
    n |= n >>> 1;  
    n |= n >>> 2; 
    n |= n >>> 4; 
    n |= n >>> 8; 
    n |= n >>> 16; 
    n >>>= 1;
    n += 1; 
    return n;
}

And:

static public final int msb_index(int n) {

    final int[] multiply_de_bruijn_bit_position = {
        0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
        31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
    };
    return multiply_de_bruijn_bit_position[(msb(n) * 0x077CB531) >>> 27];
}

Preachy answered 30/5, 2019 at 18:55 Comment(0)

-3

#define FFS(t)  \
({ \
register int n = 0; \
            \ 
if (!(0xffff & t)) \
    n += 16; \
         \
if (!((0xff << n) & t)) \
    n += 8; \
        \
if (!((0xf << n) & t)) \
    n += 4; \
        \
if (!((0x3 << n) & t)) \
    n += 2; \
        \
if (!((0x1 << n) & t)) \
    n += 1; \
        \
n; \
})

Assassinate answered 27/3, 2013 at 3:16 Comment(3)

t should probably be in parentheses here if it's a macro. or better yet put it in a local variable also so it doesn't always get computed. – Monopoly 27/3, 2013 at 3:47

it just uses binary search, I agree with your comments Claudiu, but I think there should be a more efficient way to get the result, and without use clz bsr similar instructions – Assassinate 27/3, 2013 at 7:16

This is a random number generator, not a binary search. – Friction 20/2, 2017 at 1:25

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags