Set the leading zero bits in any size integer C++
Asked Answered
B

4

26

I want to set the leading zero bits in any size integer to 1 in standard C++.

eg.

0001 0011 0101 1111 -> 1111 0011 0101 1111

All the algorithms I've found to do this require a rather expensive leading zero count. However, it's odd. There are very fast and easy ways to do other types of bit manipulation such as:

 int y = -x & x; //Extracts lowest set bit, 1110 0101 -> 0000 0001

 int y = (x + 1) & x; //Will clear the trailing ones, 1110 0101 - > 1110 0100

 int y = (x - 1) | x; //Will set the trailing zeros, 0110 0100 - > 0110 0111

So that makes me think there must be a way to set the leading zeros of an integer in one simple line of code consisting of basic bit wise operators. Please tell me there's hope because right now I'm settling for reversing the order of the bits in my integer and then using the fast way of setting trailing zeros and then reversing the integer again to get my leading zeros set to ones. Which is actually significantly faster than using a leading zero count, however still quite slow compared with the other algorithms above.

 template<typename T>
 inline constexpr void reverse(T& x)
 {
    T rev = 0;
    size_t s = sizeof(T) * CHAR_BIT;

    while(s > 0)
    {
        rev = (rev << 1) | (x & 0x01);
        x >>= 1;
        s -= 1uz;
    }//End while

    x = rev;
 }

 
 template<typename T>
 inline constexpr void set_leading_zeros(T& x)
 {

     reverse(x);

     x = (x - 1) | x;//Set trailing 0s to 1s
     
     reverse(x);
 }

Edit

Because some asked: I'm working with MS-DOS running on CPUs ranging from early X86 to a 486DX installed in older CNC machines. Fun times. :D

Bulbiferous answered 18/9, 2022 at 6:3 Comment(4)
There are faster ways to reverse a number.Yahiya
@Bulbiferous whatever solution you end up with, you can use it to clear the leading ones as well: clear_leading_ones(x) = ~set_leading_zeroes(~x)Marris
rather expensive leading zero count all modern architectures have a very cheap instruction to get the leading zero countAutolycus
you talked about "standard C++" but which standard? C++11/14/17/20 or 23...? It's also kinda weird because I don't see any modern C++ compilers for MS-DOSAutolycus
M
32

The leading zeroes can be set without counting them, while also avoiding reversing the integer. For convenience I won't do it for a generic integer type T, but likely it can be adapted, or you could use template specialization.

First calculate the mask of all the bits that aren't the leading zeroes, by "spreading" the bits downwards:

uint64_t m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32;

Then set all the bits that that mask doesn't cover:

return x | ~m;

Bonus: this automatically works even when x = 0 and when x has all bits set, one of which in a count-leading-zero approach could lead to an overly large shift amount (which one depends on the details, but almost always one of them is troublesome, since there are 65 distinct cases but only 64 valid shift amounts, if we're talking about uint64_t).

Marris answered 18/9, 2022 at 7:6 Comment(8)
@dave_thenerd: For comparison, www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html shows instruction timings for Pentium and earlier. (It doesn't take into account instruction-fetch costs, which is the main bottleneck for 8086 and especially 8088, not execution. See this answer for more about 8086 / 8088 performance.) bsr r32,r32 on 386 costs 10+3n cycles, so it's not fixed latency!!! Still slow even on Pentium, 7 to 71 cycles, depending on number of leading zeros I guess. BSF is faster than BSR on 486, but still 6 to 42c.Redletter
@dave_thenerd: Given a leading-zero count, shl/sar reg, cl or imm take 3 cycles each on 386 and 486, but it looks like the bithack is faster. Probably even if you need to do extended-precision shifting (like shrd/shr) for uint64_t on 32-bit CPUs; shrd reg, reg, imm or cl is only 3 cycles on 386 and 486. (Actually only 2 cycles for shrd reg, reg, imm on 486. On Pentium, it's 4 cycles, vs. 1 cycle for regular shifts, pairable in the U pipe.)Redletter
It would be nice to be able to see what m looks like at every step.Monodrama
How do you know when to stop shifting m? For the given example 0001 0011 0101 1111, after m |= m >> 2 I had the correct bitmask for return x | ~m to return the correct value.Bring
@Bring you could detect that m is one less than a power of two (add one, then use one of the famous tricks to test whether an integer is a power of two) and then stop, but I expect that won't be worth doing. That's a significant amount of computation (relative to how cheap a couple of shifts and bitwise ORs are) and then a potentially hard to predict branch.Marris
I guess I was confused since there were five shifts back to back, and I took that literally (i.e. execute all 6 lines of code regardless of the value of x and m).Bring
@Bring actually I intended it to be used like that: executing all 6 steps unconditionally (or 5 steps, for 32-bit numbers). It's possible to early-exit but it does not look worthwhile to me.Marris
Oh, I see, it still works even if you do all 6 steps, but the rigamarole of figuring out whether to exit early would be more difficult than just running all 6 steps anyway so there's no point.Bring
C
5

You could count leading zeroes using std::countl_zero and create a bitmask that your bitwise OR with the original value:

#include <bit>
#include <climits>
#include <type_traits>

template<class T>
requires std::is_unsigned_v<T>
T leading_ones(T v) {
    auto lz = std::countl_zero(v);
    return lz ? v | ~T{} << (CHAR_BIT * sizeof v - lz) : v;
}

If you have a std::uint16_t, like

0b0001001101011111

then ~T{} is 0b1111111111111111, CHAR_BIT * sizeof v is 16 and countl_zero(v) is 3. Left shift 0b1111111111111111 16-3 steps:

0b1110000000000000

Bitwise OR with the original:

  0b0001001101011111
| 0b1110000000000000
--------------------
= 0b1111001101011111
Cowage answered 18/9, 2022 at 6:21 Comment(13)
Thanks, but uhh, I said in the question that I know about leading zero count. I'm trying to avoid it as it is very slow.Bulbiferous
@Bulbiferous Did you try countl_zero? I expect it to be pretty quick.Cowage
You may need to specify the architecture to your compiler, the popcnt instruction may not be enabled by default. (-march=native with gcc for instance)Yahiya
@MarcGlisse Yes, for an optimal implementation. With march=native the countl_zero function produces a lzcnt instruction on x86_64. Without march_native, one bsr + an xor. I expect even the non-native one to be pretty quick.Cowage
I'm working with MS-DOS running on a CNC machine... the latest CPU we use is a 486DX... good times! :DBulbiferous
@Bulbiferous What assembler code does it produce for the 486DX? I still expect it to be rather quick.Cowage
486 does have the bitscan instructions, but as a microcoded bit-by-bit loop, taking time proportional to how many bits needs to be scannedMarris
@harold Ok. Then I expect that countl_zero will use that if there isn't a quicker way to do it.Cowage
@TedLyngmo: Yes, a modern compiler would use bsr instead of a bithack (graphics.stanford.edu/~seander/bithacks.html#IntegerLog), because bsr is fast on modern x86 (1 cycle latency & throughput on P6 and later, not bad on AMD; but surprisingly still slow in P5 Pentium (www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html) with BSR surprisingly even slower than BSF). Harold's answer generating a mask directly (instead of a count) is even more efficient than the bithack which is somewhat equivalent to a binary search for the top bit position.Redletter
@PeterCordes Nice! Good thing OP settled for Harold's answer :-)Cowage
With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do (x << lz) >> lz to sign-extend. Except if lz is the type width; if you need to handle that, generating a mask is probably more efficient. Unfortunately shl/sar reg, cl costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 shlx / sarx for it to be better than bsr ecx, dsr / mov tmp, -1 / not ecx / shl tmp, cl / or dst,reg... or whatever that has to be.Redletter
godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle x==0. Also an idea with BMI2 bzhi, if we're considering what's efficient with BMI2 available. Like x | ~ _bzhi_u32(-1, 32-lz); Unfortunately requires two inversions, the 32-lzcnt and the ~. We have BMI1 andn, but not an equivalent orn. And we can't just use neg because bzhi doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow.Redletter
@PeterCordes Very nice! Looking forward to it in an answer too!Cowage
A
3

Your reverse is extremely slow! With an N-bit int you need N iterations to reverse, each at least 6 instructions, then at least 2 instructions to set the trailing bits, and finally N iterations to reverse the value again. OTOH even the simplest leading zero count needs only N iterations, then set the leading bits directly:

template<typename T>
inline constexpr T trivial_ilog2(T x) // Slow, don't use this
{
    if (x == 0) return 0;

    size_t c{};
    while(x)
    {
        x >>= 1;
        c += 1u;
    }

    return c;
}

template<typename T>
inline constexpr T set_leading_zeros(T x)
{
    if (std::make_unsigned_t(x) >> (sizeof(T) * CHAR_BIT - 1)) // top bit is set
        return x;
    return x | (-T(1) << trivial_ilog2(x));
}

x = set_leading_zeros(x);

There are many other ways to count leading zero/get integer logarithm much faster. One of the methods involves doing in steps of powers of 2 like how to create the mask in harold's answer:

But since you're targeting a specific target instead of doing something cross-platform and want to squeeze every bit of performance, there are almost no reasons to use pure standard features since these usecases typically need platform-specific code. If intrinsics are available you should use it, for example in modern C++ there's std::countl_zero but each compiler already has intrinsics to do that which will map to the best instruction sequence for that platform, for example _BitScanReverse or __builtin_clz

If intrinsics aren't available of if the performance is still not enough then try a lookup table. For example here's a solution with 256-element log table

static const char LogTable256[256] = 
{
#define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
    -1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
    LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
    LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
};

uint16_t lut_ilog2_16(uint16_t x)
{
    uint8_t h = x >> 8;
    if (h) return LogTable256[h] + 8;
    else return LogTable256[x & 0xFF];
}

In set_leading_zeros just call lut_ilog2_16 like above

The even better solution than a log table is a mask table so that you can get the mask directly instead of calculating 1 << LogTable256[x]

static const char MaskTable256[256] =
{
    0xFF, 0xFE, 0xFC...
}

Some other notes:

  • 1uz isn't a valid suffix in C++ prior to C++23
  • Don't use references for small types that fit in a single integer. That's not necessary and is usually slower when not inlined. Just assign the result back from the function
Autolycus answered 18/9, 2022 at 11:50 Comment(1)
@TedLyngmo that's even weirder. I've never seen a C++11 compiler for DOS, let alone C++23Autolycus
R
3

(Work in progress, power just went out here; posting now to save my work.)

Crusty old x86 CPUs have very slow C++20 std::countl_zero / GNU C __builtin_clz (386 bsr = Bit Scan Reverse actually finds the position of the highest set bit, like 31-clz, and is weird for an input of 0 so you need to branch on that.) For CPUs before Pentium Pro / Pentium II, Harold's answer is your best bet, generating a mask directly instead of a count.

(Before 386, shifting by large counts might be better done with partial register shenanigans like mov al, ah / mov ah, 0 instead of shr ax, 8, since 286 and earlier didn't have a barrel shifter for constant-time shifts. But in C++, that's something for the compiler to figure out. Shift by 16 is free since a 32-bit integer can only be kept in a pair of 16-bit registers on 286 or earlier.)

  • 8086 to 286 - no instruction available.

  • 386: bsf/bsr: 10+3n cycles. Worst-case: 10+3*31 = 103c

  • 486: bsf (16 or 32-bit registers): 6-42 cycles; bsr 7-104 cycles (1 cycle less for 16-bit regs).

  • P5 Pentium: bsf: 6-42 cycles (6-34 for 16-bit); bsr 7-71 cycles. (or 7-39 for 16-bit). Non-pairable.

  • Intel P6 and later: bsr/bsr: 1 uop with 1 cycle throughput, 3 cycle latency. (PPro / PII and later).

  • AMD K7/K8/K10/Bulldozer/Zen: bsf/bsr are slowish for a modern CPU. e.g. K10 3 cycle throughput, 4 cycle latency, 6 / 7 m-ops respectively.

  • Intel Haswell / AMD K10 : lzcnt introduced (as part of BMI1 for Intel, or with its own feature bit for AMD, before tzcnt and the rest of BMI1).
    For an input of 0, they return the operand-size, so they fully implement C++20 std::countl_zero / countr_zero respectively, unlike bsr/bsf. (Which leave the destination unmodified on input=0. AMD documents this, Intel implements it in practice on current CPUs at least, but documents the destination register as "undefined" contents. Perhaps some older Intel CPUs are different, otherwise it's just annoying that they don't document the behaviour so software can take advantage.)

    On AMD, they're fast, single uop for lzcnt, with tzcnt taking one more (probably a bit-reverse to feed the lzcnt execution unit), so a nice win vs. bsf/bsr. This is why compilers typically use rep bsf when for countr_zero / __builtin_ctz, so it will run as tzcnt on CPUs that support it, but as bsf on older CPUs. They produce the same results for non-zero inputs, unlike bsr/lzcnt.

    On Intel, same fast performance as bsf/bsr, even including the output dependency until Skylake fixed that; it's a true dependency for bsf/bsr, but false dependency for tzcnt/lzcnt and popcnt.


Fast algorithm with a bit-scan building block

But on P6 (Pentium Pro) and later, a bit-scan for the highest set bit is likely to be a useful building block for an even faster strategy than log2(width) shift/or operations, especially for uint64_t on a 64-bit machine. (Or maybe even moreso for uint64_t on a 32-bit machine, where each shift would require shifting bits across the gap.)

Cycle counts from https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html which has instructions timings for 8088 through Pentium. (But not counting the instruction-fetch bottleneck which usually dominates 8086 and especially 8088 performance.)

bsr (index of highest set bit) is fast on modern x86: 1 cycle throughput on P6 and later, not bad on AMD. On even more recent x86, BMI1 lzcnt is 1 cycle on AMD as well, and avoids an output dependency (on Skylake and newer). Also it works for an input of 0 (producing the type width aka operand size), unlike bsr which leaves the destination register unmodified.

I think the best version of this (if BMI2 is available) is one inspired by Ted Lyngmo's answer, but changed to shift left / right instead of generating a mask. ISO C++ doesn't guarantee that >> is an arithmetic right shift on signed integer types, but all sane compilers choose that as their implementation-defined behaviour. (For example, GNU C documents it.)

https://godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle x==0.

Also an idea with BMI2 bzhi, if we're considering what's efficient with BMI2 available. Like x | ~ _bzhi_u32(-1, 32-lz); Unfortunately requires two inversions, the 32-lzcnt and the ~. We have BMI1 andn, but not an equivalent orn. And we can't just use neg because bzhi doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow.


int set_leading_zeros(int x){
    int lz = __builtin_clz(x|1);                // clamp the lzcount to 31 at most
    int tmp = (x<<lz);                          // shift out leading zeros, leaving a 1 (or 0 if x==0)
    tmp |= 1ULL<<(CHAR_BIT * sizeof(tmp) - 1);  // set the MSB in case x==0
    return tmp>>lz;                             // sign-extend with an arithmetic right shift.
}
#include <immintrin.h>
uint32_t set_leading_zeros_bmi2(uint32_t x){
    int32_t lz = _lzcnt_u32(x);            // returns 0 to 32 
    uint32_t mask = _bzhi_u32(-1, lz);     // handles all 33 possible values, producing 0 for lz=32
    return x | ~mask;
}

On x86-64 you can

Combined with BMI2 shlx / sarx for single-uop variable-count shifts even on Intel CPUs.

With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do (x << lz) >> lz to sign-extend. Except if lz is the type width; if you need to handle that, generating a mask is probably more efficient.

Unfortunately shl/sar reg, cl costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 shlx / sarx for it to be better than bsr ecx, dsr / mov tmp, -1 / not ecx / shl tmp, cl / or dst,reg

Redletter answered 24/9, 2022 at 1:9 Comment(2)
todo, finish this and link chessprogramming.org/BitScan#Bsf.2FBsr_x86-64_TimingsRedletter
I just stumbled on this "old" question and noticed that you did come back to write an answer as you hinted you might do. :-) Great!Cowage

© 2022 - 2024 — McMap. All rights reserved.