Could anyone tell me please what is an efficient algorithm to count the number of leading zeroes in a 32-bit unsigned integer in C programming?
This discussion assumes that your compiler either doesn't support the operation or that it doesn't produce good enough assembly. Note that both of these are unlikely nowadays so I'd recommend just using __builtin_clz
for gcc or equivalent on your compiler.
Note that determining which is the "best" clz algo can only be done by you. Modern processors are complicated beasts and the performance of these algorithm will heavily depend on the platform you run it on, the data you throw at it and the code that uses it. The only way to be sure is to measure, measure and measure some more. If you can't tell the difference then you're probably not looking at your bottleneck and your time will be better spent elsewhere.
Now that the boring disclaimers are out of the way, let's take a look at what Hacker's Delight has to say about the problem. A quick survey shows that all algorithms rely on a binary search of some description. Here's a straightforward example:
int n = 32;
unsigned y;
y = x >>16; if (y != 0) { n = n -16; x = y; }
y = x >> 8; if (y != 0) { n = n - 8; x = y; }
y = x >> 4; if (y != 0) { n = n - 4; x = y; }
y = x >> 2; if (y != 0) { n = n - 2; x = y; }
y = x >> 1; if (y != 0) return n - 2;
return n - x;
Note that this works on 32 ints and that it can also be transformed into an iterative version if needed. Unfortunately that solution doesn't have a whole lot of instruction-level parallelism and has quite a few branches which doesn't make for a very good bit twiddling algo. Note that a branch free version of the above code exists but it's much more verbose so I won't reproduce here.
So let's improve on the solution by using the pop instruction (counts the number of bits):
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >>16);
return pop(~x);
So how does this work? The key is the pop(~x)
instruction at the end which counts the numbers of zeros in x
. For the count of zeros to be meaningful we first need to get rid of all 0s which aren't leading. We do this by right propagating the 1s using a binary algorithm. While we still don't have much instruction-level parallelism, we did get rid of all the branches and it uses fewer cycles then the previous solution. Much better.
So how about that pop instruction, isn't that cheating? Most architecture have a 1 cycle pop instruction which can be accessed via compiler builtins (eg. gcc's __builtin_pop
). Otherwise there exist table based solutions but care must be taken when trading off cycles for cache accesses even if the table is kept entirely in the L1 cache.
Finally, as is usual for hacker's delight, we start wandering in strange territories. Let's count us some leading zeroes using floating point numbers:
union {
unsigned asInt[2];
double asDouble;
};
asDouble = (double)k + 0.5;
return 1054 - (asInt[LE] >> 20);
First off, a little warning: DON'T USE THIS ALGORITHM. This triggers undefined behaviour as far as the standard is concerned. This was reproduce for the fun factor more then any practical uses. Use at your own peril.
Now that the disclaimers are out of the way, how does it work? It first converts the int into a double and goes on to extract the exponent component of the double. Neat stuff. The LE constant should be 1 if executed on a little-endian machine and 0 on a big-endian machine.
This should give you a brief survey of various bit twiddling algorithms for this problem. Note that the book has several variations of these which makes various tradeoffs but I'll let you discover those on your own.
This is probably the optimal way to do it in pure C:
int clz(uint32_t x)
{
static const char debruijn32[32] = {
0, 31, 9, 30, 3, 8, 13, 29, 2, 5, 7, 21, 12, 24, 28, 19,
1, 10, 4, 14, 6, 22, 25, 20, 11, 15, 23, 26, 16, 27, 17, 18
};
x |= x>>1;
x |= x>>2;
x |= x>>4;
x |= x>>8;
x |= x>>16;
x++;
return debruijn32[x*0x076be629>>27];
}
One limitation: as written, it does not support an input of zero (where the result should be 32). If all your inputs are less than 0x80000000
, you can support zero with no additional cost by changing the first value in the table to 32. Otherwise, just add a line at the beginning:
if (!x) return 32;
clz
in commercial software? Asking because the content contributed on or after 2018-05-02 (UTC) is distributed under the terms of CC BY-SA 4.0 (link). And CC BY-SA 4.0 may have (compatibility) issues with licenses for commercial / proprietary software. If yes, then under what conditions? –
Salto Let's count the number of digits which are not leading zeros. After that we just do (32 - n). First, if the number is zero, n is zero. Otherwise:
n = 1 + floor(log2(x))
That is, we use the base-two logarithm to find out in what position is the most significant non-zero bit. We can do this efficiently on x86 using the FYL2X instruction, which computes log2.
But now that we're talking about x86 instructions, we may as well look at what's really available. Here it is! http://en.wikipedia.org/wiki/Find_first_set - you can see there that there are plenty of instructions that directly do what you want--if you are willing to write assembly or at least confirm that your optimizing compiler generates these instructions for you given some carefully written C code.
bsr
or lzcnt
on newer architectures? –
Patrick © 2022 - 2024 — McMap. All rights reserved.
0
s and substract the result from the maximum possible digits of a 32bit integer. – Jalopy0...0101
. 29 leading zeros, 1 trailing one, 29 != 32-1. – Gulgee0
digit and substract ... – Jalopyabs(log10(INT_MAX)) - abs(log10(x))
. – Jalopyabs
in my above comment should have been(int)
.... S-) – Jalopyint leadingZeros = tableOfLeadingZeroCounts[theNumber];
? Just one operation (not counting the operations needed to build the 4g entry array). – Clathratelonglong.h
. It consists of 2200 lines of code with multiple branches depending on the cpu and its characteristics. The file does not use one line of inline ASM. – Crookback__asm__
. – Salto