How to count leading zeros in a 32 bit unsigned integer [closed]
Asked Answered
C

3

22

Could anyone tell me please what is an efficient algorithm to count the number of leading zeroes in a 32-bit unsigned integer in C programming?

Claar answered 25/5, 2014 at 14:57 Comment(20)
Count the trailing non-0s and substract the result from the maximum possible digits of a 32bit integer.Jalopy
@Jalopy Try 0...0101. 29 leading zeros, 1 trailing one, 29 != 32-1.Gulgee
What's the problem? @delnan? Okok, my proposal what worded unprecise.Jalopy
Count the number of trailing digits with and to the right of the most left non-0 digit and substract ...Jalopy
Or just do abs(log10(INT_MAX)) - abs(log10(x)).Jalopy
What means "best"? Do you means "fastest"?Bellaude
@Jalopy Are you working under the assumption that we're talking about decimal zeros? Usually this problem is posed, partly because it doesn't make a lot of sense in any other base and partly because that information isn't useful (whereas the number of leading bits can be used for some smart tricks).Gulgee
The abs in my above comment should have been (int) .... S-)Jalopy
See graphics.stanford.edu/~seander/bithacks.html for a collection of tricks. In your particular case, remember that the number of leading zeroes and the position of the leftmost one can be easily computed from each other.Integrated
There are already answers in this question. The most common solution is still in Stanford's Bit Twiddling Hacks pageShepley
Define "best". What are the criteria?Clathrate
What about int leadingZeros = tableOfLeadingZeroCounts[theNumber];? Just one operation (not counting the operations needed to build the 4g entry array).Clathrate
@HotLicks Aside from taking ridiculous amounts of memory, it's also probably slower than most solutions here since most accesses will be a cache miss which takes a hundred cycles or longer. Even a naive bit-by-bit loop will take fewer cycles.Gulgee
another duplicate #672315Shepley
@delnan - It all depends on your definition of "best".Clathrate
@HotLicks It's not even simpler to implement, since you need one of the other solutions to populate the table. So under what definition of "best" is it a decent approach? For most purposes, whether it is best under some criteria is also irrelevant because using 4 GiB of memory immediately disqualifies it even if memory use is of no real concern.Gulgee
@delnan - You don't need one of the other solutions -- you just manually examine each possible value and initialize the table. (And you realize, of course, I was being facetious -- just illustrating that unless "best" is somehow defined the question is meaningless.)Clathrate
This is a non-trivial task. As an example, see GMP's code to do it at longlong.h. It consists of 2200 lines of code with multiple branches depending on the cpu and its characteristics. The file does not use one line of inline ASM.Crookback
@Crookback Re: "The file does not use one line of inline ASM". I see that it has many __asm__.Salto
@Crookback - No. You are misunderstanding what is going on in this source. This is a portability header for accomplishing this task for quad-word sized integers on a variety of systems that do not support that integer width. It has little or nothing to do with OP's question. (We're not accepting answers here any more, but the best way to do this if you are using gcc is __builtin_clz(x) - it will compile on most modern processors to a single instruction for all integer widths.)Worriment
E
26

This discussion assumes that your compiler either doesn't support the operation or that it doesn't produce good enough assembly. Note that both of these are unlikely nowadays so I'd recommend just using __builtin_clz for gcc or equivalent on your compiler.

Note that determining which is the "best" clz algo can only be done by you. Modern processors are complicated beasts and the performance of these algorithm will heavily depend on the platform you run it on, the data you throw at it and the code that uses it. The only way to be sure is to measure, measure and measure some more. If you can't tell the difference then you're probably not looking at your bottleneck and your time will be better spent elsewhere.

Now that the boring disclaimers are out of the way, let's take a look at what Hacker's Delight has to say about the problem. A quick survey shows that all algorithms rely on a binary search of some description. Here's a straightforward example:

int n = 32;
unsigned y;

y = x >>16; if (y != 0) { n = n -16; x = y; }
y = x >> 8; if (y != 0) { n = n - 8; x = y; }
y = x >> 4; if (y != 0) { n = n - 4; x = y; }
y = x >> 2; if (y != 0) { n = n - 2; x = y; }
y = x >> 1; if (y != 0) return n - 2;
return n - x;

Note that this works on 32 ints and that it can also be transformed into an iterative version if needed. Unfortunately that solution doesn't have a whole lot of instruction-level parallelism and has quite a few branches which doesn't make for a very good bit twiddling algo. Note that a branch free version of the above code exists but it's much more verbose so I won't reproduce here.

So let's improve on the solution by using the pop instruction (counts the number of bits):

x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >>16);
return pop(~x);

So how does this work? The key is the pop(~x) instruction at the end which counts the numbers of zeros in x. For the count of zeros to be meaningful we first need to get rid of all 0s which aren't leading. We do this by right propagating the 1s using a binary algorithm. While we still don't have much instruction-level parallelism, we did get rid of all the branches and it uses fewer cycles then the previous solution. Much better.

So how about that pop instruction, isn't that cheating? Most architecture have a 1 cycle pop instruction which can be accessed via compiler builtins (eg. gcc's __builtin_pop). Otherwise there exist table based solutions but care must be taken when trading off cycles for cache accesses even if the table is kept entirely in the L1 cache.

Finally, as is usual for hacker's delight, we start wandering in strange territories. Let's count us some leading zeroes using floating point numbers:

union {
    unsigned asInt[2];
    double asDouble;
};
asDouble = (double)k + 0.5;
return 1054 - (asInt[LE] >> 20);

First off, a little warning: DON'T USE THIS ALGORITHM. This triggers undefined behaviour as far as the standard is concerned. This was reproduce for the fun factor more then any practical uses. Use at your own peril.

Now that the disclaimers are out of the way, how does it work? It first converts the int into a double and goes on to extract the exponent component of the double. Neat stuff. The LE constant should be 1 if executed on a little-endian machine and 0 on a big-endian machine.

This should give you a brief survey of various bit twiddling algorithms for this problem. Note that the book has several variations of these which makes various tradeoffs but I'll let you discover those on your own.

Ers answered 25/5, 2014 at 15:43 Comment(2)
With this one ,you can ignore endian of the machine. int clz(uint32_t x) { union { double ddd; int64_t uu; } u; u.ddd = x + 0.5; return 1054 - (int)(u.uu >> 52); }Lackadaisical
Sadly, hackersdelight.org seems to be no more, and the domain was taken over by spammers. Googling did find some pdf copies on line.Felty
C
24

This is probably the optimal way to do it in pure C:

int clz(uint32_t x)
{
    static const char debruijn32[32] = {
        0, 31, 9, 30, 3, 8, 13, 29, 2, 5, 7, 21, 12, 24, 28, 19,
        1, 10, 4, 14, 6, 22, 25, 20, 11, 15, 23, 26, 16, 27, 17, 18
    };
    x |= x>>1;
    x |= x>>2;
    x |= x>>4;
    x |= x>>8;
    x |= x>>16;
    x++;
    return debruijn32[x*0x076be629>>27];
}

One limitation: as written, it does not support an input of zero (where the result should be 32). If all your inputs are less than 0x80000000, you can support zero with no additional cost by changing the first value in the table to 32. Otherwise, just add a line at the beginning:

    if (!x) return 32;
Campanology answered 26/5, 2014 at 3:2 Comment(4)
For the record, Hacker's Delight also contains this algorithm along with an explanation of how and why it works. I was just too lazy to reproduce the entire table :)Ers
Is my table the same as theirs? I produced it manually by inverting the table I use for ctz to function as clz instead.Campanology
Actually there's 2 of em. The first is Harley's which uses a bigger table size (64), doesn't have the increment and uses a different multiplier (0x06EB14F9) and shift op (26). The second, is Goryavsky who actually derived several variations that have various trade-offs (smaller table size, better ILP, etc).Ers
Legal aspect: do you allow to use your clz in commercial software? Asking because the content contributed on or after 2018-05-02 (UTC) is distributed under the terms of CC BY-SA 4.0 (link). And CC BY-SA 4.0 may have (compatibility) issues with licenses for commercial / proprietary software. If yes, then under what conditions?Salto
P
-3

Let's count the number of digits which are not leading zeros. After that we just do (32 - n). First, if the number is zero, n is zero. Otherwise:

n = 1 + floor(log2(x))

That is, we use the base-two logarithm to find out in what position is the most significant non-zero bit. We can do this efficiently on x86 using the FYL2X instruction, which computes log2.

But now that we're talking about x86 instructions, we may as well look at what's really available. Here it is! http://en.wikipedia.org/wiki/Find_first_set - you can see there that there are plenty of instructions that directly do what you want--if you are willing to write assembly or at least confirm that your optimizing compiler generates these instructions for you given some carefully written C code.

Pb answered 25/5, 2014 at 15:11 Comment(5)
OP specifically asked for the best algorithm in C, not in x86 asm.Campanology
"Efficiently" and "fyl2x" don't fit in the same sentence. It's one of the slowest instructions by far.Firkin
Why would you go for something this arcane (and slow - x87) when you could make use of bsr or lzcnt on newer architectures?Patrick
@BrettHale: I linked to the Wikipedia page about bsr. Of course that's what should be used. I discuss this in my last paragraph.Pb
@John Zwinck, thank you so much for all the information. we can also use ceil(log2(x+1)) to find the number of binary digits in x. right ? now, does it take constant time to compute ? in case of signed integer, what should i do to count the number of binary digits for a given input ? thank you so much again for your cooperation :)Claar

© 2022 - 2024 — McMap. All rights reserved.