Find nth SET bit in an int
Asked Answered
J

16

31

Instead of just the lowest set bit, I want to find the position of the nth lowest set bit. (I'm NOT talking about value on the nth bit position)

For example, say I have:
0000 1101 1000 0100 1100 1000 1010 0000

And I want to find the 4th bit that is set. Then I want it to return:
0000 0000 0000 0000 0100 0000 0000 0000

If popcnt(v) < n, it would make sense if this function returned 0, but any behavior for this case is acceptable for me.

I'm looking for something faster than a loop if possible.

Jazzy answered 5/10, 2011 at 23:51 Comment(11)
Are you asking for a general method which can be applied to give you a way to calculate the nth lowest bit for any constant n, or do you need it to work for any n given at runtime? Based on the mask-reduce pattern of these kind of hacks, I seriously doubt there's an elegant way to do the latter without a looping construct.Lettuce
Yeah, you supply both v and n at runtime. I also couldn't think of any way to do it without looping. It is hard to divide up the problem, but I'm not convinced it's impossible to beat a loop.Jazzy
There is a solution to the opposite problem offered at the Bit Twiddling Hacks page. Scroll down to "Select the bit position (from the most-significant bit) with the given count (rank)" section. You should be able to re-work it for counting bits in the opposite direction.Byrann
@dasblinkenlight Neato! Though, TBH, I don't really understand how the algorithm works. Let's see if I can figure it out.Opuntia
@PSkocik: they are zero-based, so it's actually the 5th set bit.Trocar
How about a popcount binary search? With a mask for limiting the set of bits considered?Buttonhook
My initial idea was to split the 32-bit number into 4 single-byte chunks, use a 256-byte lookup table to quickly get number of set bits for each byte, and then use these intermediate steps to locate the required position. So I believe the idea from the Bit Twiddling Hacks is (more or less) the same, but without the lookup table and branchless.Trocar
@Groo Thank you.V
@PSkocik Sad that you deleted your comment. It would have been very helpful to others with the same misunderstanding.Opuntia
Adding to Groo's comment: Using a nibble lookup table and VPSHUFB, you can actually do multiple lookups in parallelTacy
probably a little bit related: What is the fastest way to return the positions of all set bits in a 64-bit integer?Jacqulynjactation
J
11

It turns out that it is indeed possible to do this with no loops. It is fastest to precompute the (at least) 8 bit version of this problem. Of course, these tables use up cache space, but there should still be a net speedup in virtually all modern pc scenarios. In this code, n=0 returns the least set bit, n=1 is second-to-least, etc.

Solution with __popcnt

There is a solution using the __popcnt intrinsic (you need __popcnt to be extremely fast or any perf gains over a simple loop solution will be moot. Fortunately most SSE4+ era processors support it).

// lookup table for sub-problem: 8-bit v
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v < 256 and n < 8

ulong nthSetBit(ulong v, ulong n) {
    ulong p = __popcnt(v & 0xFFFF);
    ulong shift = 0;
    if (p <= n) {
        v >>= 16;
        shift += 16;
        n -= p;
    }
    p = __popcnt(v & 0xFF);
    if (p <= n) {
        shift += 8;
        v >>= 8;
        n -= p;
    }

    if (n >= 8) return 0; // optional safety, in case n > # of set bits
    return PRECOMP[v & 0xFF][n] << shift;
}

This illustrates how the divide and conquer approach works.

General Solution

There is also a solution for "general" architectures- without __popcnt. It can be done by processing in 8-bit chunks. You need one more lookup table that tells you the popcnt of a byte:

byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v<256 and n < 8
byte POPCNT[256] = { ... } // POPCNT[v] is the number of set bits in v. (v < 256)

ulong nthSetBit(ulong v, ulong n) {
    ulong p = POPCNT[v & 0xFF];
    ulong shift = 0;
    if (p <= n) {
        n -= p;
        v >>= 8;
        shift += 8;
        p = POPCNT[v & 0xFF];
        if (p <= n) {
            n -= p;
            shift += 8;
            v >>= 8;
            p = POPCNT[v & 0xFF];
            if (p <= n) {
                n -= p;
                shift += 8;
                v >>= 8;
            }
        }
    }

    if (n >= 8) return 0; // optional safety, in case n > # of set bits
    return PRECOMP[v & 0xFF][n] << shift;
}

This could, of course, be done with a loop, but the unrolled form is faster and the unusual form of the loop would make it unlikely that the compiler could automatically unroll it for you.

Jazzy answered 6/10, 2011 at 7:54 Comment(4)
Well done! Do you feel like posting some runtime statistics of the different methods in this thread?Mceachern
Is it possible to do this without any conditions?Kayceekaye
A slightly faster algorithm is implemented in it.unimi.dsi.bits.Fast.select.Prussiate
Interestingly, my benchmark indicates the ffsn loop-based answer below is faster than the tablebased nthSetBit with __popcnt in this answer.Myrlemyrlene
S
43

Nowadays this is very easy with PDEP from the BMI2 instruction set. Here is a 64-bit version with some examples:

#include <cassert>
#include <cstdint>
#include <x86intrin.h>

inline uint64_t nthset(uint64_t x, unsigned n) {
    return _pdep_u64(1ULL << n, x);
}

int main() {
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 0) ==
                  0b0000'0000'0000'0000'0000'0000'0010'0000);
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 1) ==
                  0b0000'0000'0000'0000'0000'0000'1000'0000);
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 3) ==
                  0b0000'0000'0000'0000'0100'0000'0000'0000);
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 9) ==
                  0b0000'1000'0000'0000'0000'0000'0000'0000);
    assert(nthset(0b0000'1101'1000'0100'1100'1000'1010'0000, 10) ==
                  0b0000'0000'0000'0000'0000'0000'0000'0000);
}

If you just want the (zero-based) index of the nth set bit, add a trailing zero count.

inline unsigned nthset(uint64_t x, unsigned n) {
    return _tzcnt_u64(_pdep_u64(1ULL << n, x));
}
Scorecard answered 12/12, 2014 at 23:5 Comment(5)
Cool, so you use the original bitstring for the deposit indices, and then supply a bitstring where exactly the only the nth lower order bit is set.Mceachern
Thanks a lot for this. It's amazing! How did you come by this idea?Zita
Very cool! Perhaps it would be worth while to mention that the correctness of the solution becomes much more apparent when the visual description of PDEP/PEXT is consulted (Can be seen in intel documentation)Celeski
Truly brilliant!Buckler
Note that AMD Ryzen CPUs before Zen 3 support PDEP and PEXT, but emulated in microcode, and it's very slow. Zen 3 has a fast hardware implementation of PDEP and PEXT, as do Intel CPUs.Stung
J
11

It turns out that it is indeed possible to do this with no loops. It is fastest to precompute the (at least) 8 bit version of this problem. Of course, these tables use up cache space, but there should still be a net speedup in virtually all modern pc scenarios. In this code, n=0 returns the least set bit, n=1 is second-to-least, etc.

Solution with __popcnt

There is a solution using the __popcnt intrinsic (you need __popcnt to be extremely fast or any perf gains over a simple loop solution will be moot. Fortunately most SSE4+ era processors support it).

// lookup table for sub-problem: 8-bit v
byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v < 256 and n < 8

ulong nthSetBit(ulong v, ulong n) {
    ulong p = __popcnt(v & 0xFFFF);
    ulong shift = 0;
    if (p <= n) {
        v >>= 16;
        shift += 16;
        n -= p;
    }
    p = __popcnt(v & 0xFF);
    if (p <= n) {
        shift += 8;
        v >>= 8;
        n -= p;
    }

    if (n >= 8) return 0; // optional safety, in case n > # of set bits
    return PRECOMP[v & 0xFF][n] << shift;
}

This illustrates how the divide and conquer approach works.

General Solution

There is also a solution for "general" architectures- without __popcnt. It can be done by processing in 8-bit chunks. You need one more lookup table that tells you the popcnt of a byte:

byte PRECOMP[256][8] = { .... } // PRECOMP[v][n] for v<256 and n < 8
byte POPCNT[256] = { ... } // POPCNT[v] is the number of set bits in v. (v < 256)

ulong nthSetBit(ulong v, ulong n) {
    ulong p = POPCNT[v & 0xFF];
    ulong shift = 0;
    if (p <= n) {
        n -= p;
        v >>= 8;
        shift += 8;
        p = POPCNT[v & 0xFF];
        if (p <= n) {
            n -= p;
            shift += 8;
            v >>= 8;
            p = POPCNT[v & 0xFF];
            if (p <= n) {
                n -= p;
                shift += 8;
                v >>= 8;
            }
        }
    }

    if (n >= 8) return 0; // optional safety, in case n > # of set bits
    return PRECOMP[v & 0xFF][n] << shift;
}

This could, of course, be done with a loop, but the unrolled form is faster and the unusual form of the loop would make it unlikely that the compiler could automatically unroll it for you.

Jazzy answered 6/10, 2011 at 7:54 Comment(4)
Well done! Do you feel like posting some runtime statistics of the different methods in this thread?Mceachern
Is it possible to do this without any conditions?Kayceekaye
A slightly faster algorithm is implemented in it.unimi.dsi.bits.Fast.select.Prussiate
Interestingly, my benchmark indicates the ffsn loop-based answer below is faster than the tablebased nthSetBit with __popcnt in this answer.Myrlemyrlene
P
10

v-1 has a zero where v has its least significant "one" bit, while all more significant bits are the same. This leads to the following function:

int ffsn(unsigned int v, int n) {
   for (int i=0; i<n-1; i++) {
      v &= v-1; // remove the least significant bit
   }
   return v & ~(v-1); // extract the least significant bit
}
Postmeridian answered 6/10, 2011 at 0:49 Comment(1)
(what do the fs in ffsn()stand for?) (yours may well be more readable than while (0 < --n))Campagna
B
6

The version from bit-twiddling hacks adapted to this case is, for example,

unsigned int nth_bit_set(uint32_t value, unsigned int n)
{
    const uint32_t  pop2  = (value & 0x55555555u) + ((value >> 1) & 0x55555555u);
    const uint32_t  pop4  = (pop2  & 0x33333333u) + ((pop2  >> 2) & 0x33333333u);
    const uint32_t  pop8  = (pop4  & 0x0f0f0f0fu) + ((pop4  >> 4) & 0x0f0f0f0fu);
    const uint32_t  pop16 = (pop8  & 0x00ff00ffu) + ((pop8  >> 8) & 0x00ff00ffu);
    const uint32_t  pop32 = (pop16 & 0x000000ffu) + ((pop16 >>16) & 0x000000ffu);
    unsigned int    rank  = 0;
    unsigned int    temp;

    if (n++ >= pop32)
        return 32;

    temp = pop16 & 0xffu;
    /* if (n > temp) { n -= temp; rank += 16; } */
    rank += ((temp - n) & 256) >> 4;
    n -= temp & ((temp - n) >> 8);

    temp = (pop8 >> rank) & 0xffu;
    /* if (n > temp) { n -= temp; rank += 8; } */
    rank += ((temp - n) & 256) >> 5;
    n -= temp & ((temp - n) >> 8);

    temp = (pop4 >> rank) & 0x0fu;
    /* if (n > temp) { n -= temp; rank += 4; } */
    rank += ((temp - n) & 256) >> 6;
    n -= temp & ((temp - n) >> 8);

    temp = (pop2 >> rank) & 0x03u;
    /* if (n > temp) { n -= temp; rank += 2; } */
    rank += ((temp - n) & 256) >> 7;
    n -= temp & ((temp - n) >> 8);

    temp = (value >> rank) & 0x01u;
    /* if (n > temp) rank += 1; */
    rank += ((temp - n) & 256) >> 8;

    return rank;
}

which, when compiled in a separate compilation unit, on gcc-5.4.0 using -Wall -O3 -march=native -mtune=native on Intel Core i5-4200u, yields

00400a40 <nth_bit_set>:
  400a40: 89 f9                   mov    %edi,%ecx
  400a42: 89 f8                   mov    %edi,%eax
  400a44: 55                      push   %rbp
  400a45: 40 0f b6 f6             movzbl %sil,%esi
  400a49: d1 e9                   shr    %ecx
  400a4b: 25 55 55 55 55          and    $0x55555555,%eax
  400a50: 53                      push   %rbx
  400a51: 81 e1 55 55 55 55       and    $0x55555555,%ecx
  400a57: 01 c1                   add    %eax,%ecx
  400a59: 41 89 c8                mov    %ecx,%r8d
  400a5c: 89 c8                   mov    %ecx,%eax
  400a5e: 41 c1 e8 02             shr    $0x2,%r8d
  400a62: 25 33 33 33 33          and    $0x33333333,%eax
  400a67: 41 81 e0 33 33 33 33    and    $0x33333333,%r8d
  400a6e: 41 01 c0                add    %eax,%r8d
  400a71: 45 89 c1                mov    %r8d,%r9d
  400a74: 44 89 c0                mov    %r8d,%eax
  400a77: 41 c1 e9 04             shr    $0x4,%r9d
  400a7b: 25 0f 0f 0f 0f          and    $0xf0f0f0f,%eax
  400a80: 41 81 e1 0f 0f 0f 0f    and    $0xf0f0f0f,%r9d
  400a87: 41 01 c1                add    %eax,%r9d
  400a8a: 44 89 c8                mov    %r9d,%eax
  400a8d: 44 89 ca                mov    %r9d,%edx
  400a90: c1 e8 08                shr    $0x8,%eax
  400a93: 81 e2 ff 00 ff 00       and    $0xff00ff,%edx
  400a99: 25 ff 00 ff 00          and    $0xff00ff,%eax
  400a9e: 01 d0                   add    %edx,%eax
  400aa0: 0f b6 d8                movzbl %al,%ebx
  400aa3: c1 e8 10                shr    $0x10,%eax
  400aa6: 0f b6 d0                movzbl %al,%edx
  400aa9: b8 20 00 00 00          mov    $0x20,%eax
  400aae: 01 da                   add    %ebx,%edx
  400ab0: 39 f2                   cmp    %esi,%edx
  400ab2: 77 0c                   ja     400ac0 <nth_bit_set+0x80>
  400ab4: 5b                      pop    %rbx
  400ab5: 5d                      pop    %rbp
  400ab6: c3                      retq   

  400ac0: 83 c6 01                add    $0x1,%esi
  400ac3: 89 dd                   mov    %ebx,%ebp
  400ac5: 29 f5                   sub    %esi,%ebp
  400ac7: 41 89 ea                mov    %ebp,%r10d
  400aca: c1 ed 08                shr    $0x8,%ebp
  400acd: 41 81 e2 00 01 00 00    and    $0x100,%r10d
  400ad4: 21 eb                   and    %ebp,%ebx
  400ad6: 41 c1 ea 04             shr    $0x4,%r10d
  400ada: 29 de                   sub    %ebx,%esi
  400adc: c4 42 2b f7 c9          shrx   %r10d,%r9d,%r9d
  400ae1: 41 0f b6 d9             movzbl %r9b,%ebx
  400ae5: 89 dd                   mov    %ebx,%ebp
  400ae7: 29 f5                   sub    %esi,%ebp
  400ae9: 41 89 e9                mov    %ebp,%r9d
  400aec: 41 81 e1 00 01 00 00    and    $0x100,%r9d
  400af3: 41 c1 e9 05             shr    $0x5,%r9d
  400af7: 47 8d 14 11             lea    (%r9,%r10,1),%r10d
  400afb: 41 89 e9                mov    %ebp,%r9d
  400afe: 41 c1 e9 08             shr    $0x8,%r9d
  400b02: c4 42 2b f7 c0          shrx   %r10d,%r8d,%r8d
  400b07: 41 83 e0 0f             and    $0xf,%r8d
  400b0b: 44 21 cb                and    %r9d,%ebx
  400b0e: 45 89 c3                mov    %r8d,%r11d
  400b11: 29 de                   sub    %ebx,%esi
  400b13: 5b                      pop    %rbx
  400b14: 41 29 f3                sub    %esi,%r11d
  400b17: 5d                      pop    %rbp
  400b18: 44 89 da                mov    %r11d,%edx
  400b1b: 41 c1 eb 08             shr    $0x8,%r11d
  400b1f: 81 e2 00 01 00 00       and    $0x100,%edx
  400b25: 45 21 d8                and    %r11d,%r8d
  400b28: c1 ea 06                shr    $0x6,%edx
  400b2b: 44 29 c6                sub    %r8d,%esi
  400b2e: 46 8d 0c 12             lea    (%rdx,%r10,1),%r9d
  400b32: c4 e2 33 f7 c9          shrx   %r9d,%ecx,%ecx
  400b37: 83 e1 03                and    $0x3,%ecx
  400b3a: 41 89 c8                mov    %ecx,%r8d
  400b3d: 41 29 f0                sub    %esi,%r8d
  400b40: 44 89 c0                mov    %r8d,%eax
  400b43: 41 c1 e8 08             shr    $0x8,%r8d
  400b47: 25 00 01 00 00          and    $0x100,%eax
  400b4c: 44 21 c1                and    %r8d,%ecx
  400b4f: c1 e8 07                shr    $0x7,%eax
  400b52: 29 ce                   sub    %ecx,%esi
  400b54: 42 8d 14 08             lea    (%rax,%r9,1),%edx
  400b58: c4 e2 6b f7 c7          shrx   %edx,%edi,%eax
  400b5d: 83 e0 01                and    $0x1,%eax
  400b60: 29 f0                   sub    %esi,%eax
  400b62: 25 00 01 00 00          and    $0x100,%eax
  400b67: c1 e8 08                shr    $0x8,%eax
  400b6a: 01 d0                   add    %edx,%eax
  400b6c: c3                      retq

When compiled as a separate compilation unit, timing on this machine is difficult, because the actual operation is as fast as calling a do-nothing function (also compiled in a separate compilation unit); essentially, the calculation is done during the latencies associated with the function call.

It seems to be slightly faster than my suggestion of a binary search,

unsigned int nth_bit_set(uint32_t value, unsigned int n)
{
    uint32_t      mask = 0x0000FFFFu;
    unsigned int  size = 16u;
    unsigned int  base = 0u;

    if (n++ >= __builtin_popcount(value))
        return 32;

    while (size > 0) {
        const unsigned int  count = __builtin_popcount(value & mask);
        if (n > count) {
            base += size;
            size >>= 1;
            mask |= mask << size;
        } else {
            size >>= 1;
            mask >>= size;
        }
    }

    return base;
}

where the loop is executed exactly five times, compiling to

00400ba0 <nth_bit_set>:
  400ba0: 83 c6 01                add    $0x1,%esi
  400ba3: 31 c0                   xor    %eax,%eax
  400ba5: b9 10 00 00 00          mov    $0x10,%ecx
  400baa: ba ff ff 00 00          mov    $0xffff,%edx
  400baf: 45 31 db                xor    %r11d,%r11d
  400bb2: 66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  400bb8: 41 89 c9                mov    %ecx,%r9d
  400bbb: 41 89 f8                mov    %edi,%r8d
  400bbe: 41 d0 e9                shr    %r9b
  400bc1: 41 21 d0                and    %edx,%r8d
  400bc4: c4 62 31 f7 d2          shlx   %r9d,%edx,%r10d
  400bc9: f3 45 0f b8 c0          popcnt %r8d,%r8d
  400bce: 41 09 d2                or     %edx,%r10d
  400bd1: 44 38 c6                cmp    %r8b,%sil
  400bd4: 41 0f 46 cb             cmovbe %r11d,%ecx
  400bd8: c4 e2 33 f7 d2          shrx   %r9d,%edx,%edx
  400bdd: 41 0f 47 d2             cmova  %r10d,%edx
  400be1: 01 c8                   add    %ecx,%eax
  400be3: 44 89 c9                mov    %r9d,%ecx
  400be6: 45 84 c9                test   %r9b,%r9b
  400be9: 75 cd                   jne    400bb8 <nth_bit_set+0x18>
  400beb: c3                      retq   

as in, not more than 31 cycles in 95% of calls to the binary search version, compared to not more than 28 cycles in 95% of calls to the bit-hack version; both run within 28 cycles in 50% of the cases. (The loop version takes up to 56 cycles in 95% of calls, up to 37 cycles median.)

To determine which one is better in actual real-world code, one would have to do a proper benchmark within the real-world task; at least with current x86-64 architecture processors, the work done is easily hidden in latencies incurred elsewhere (like function calls).

Buttonhook answered 3/8, 2017 at 14:38 Comment(0)
T
2

My answer is mostly based on this implementation of a 64bit word select method (Hint: Look only at the MARISA_USE_POPCNT, MARISA_X64, MARISA_USE_SSE3 codepaths):

It works in two steps, first selecting the byte containing the n-th set bit and then using a lookup table inside the byte:

  • Extract the lower and higher nibbles for every byte (bitmasks 0xF, 0xF0, shift the higher nibbles down)
  • Replace the nibble values by their popcount (_mm_shuffle_epi8 with A000120)
  • Sum the popcounts of the lower and upper nibbles (Normal SSE addition) to get byte popcounts
  • Compute the prefix sum over all byte popcounts (multiplication with 0x01010101...)
  • Propagate the position n to all bytes (SSE broadcast or again multiplication with 0x01010101...)
  • Do a bytewise comparison (_mm_cmpgt_epi8 leaves 0xFF in every byte smaller than n)
  • Compute the byte offset by doing a popcount on the result

Now we know which byte contains the bit and a simple byte lookup table like in grek40's answer suffices to get the result.

Note however that I have not really benchmarked this result against other implementations, only that I have seen it to be quite efficient (and branchless)

Tacy answered 3/8, 2017 at 12:54 Comment(5)
Interesting. SSE isn't really portable though, my code should also run on ARM and possibly SPARC.Opuntia
Note that the code itself is portable if no SSE #defines are presentTacy
Yeah, but then it's not fast.Opuntia
Actually, I'm not sure about that - It uses no branches and only a medium number of elementary operations plus a lookup. Like in most cases, a benchmark could probably give more conclusive results ;)Tacy
pshufb / _mm_shuffle_epi8 is SSSE3, not SSE3.Trimetrogon
B
1

I cant see a method without a loop, what springs to mind would be;

int set = 0;
int pos = 0;
while(set < n) {
   if((bits & 0x01) == 1) set++;
   bits = bits >> 1;
   pos++;
}

after which, pos would hold the position of the nth lowest-value set bit.

The only other thing that I can think of would be a divide and conquer approach, which might yield O(log(n)) rather than O(n)...but probably not.

Edit: you said any behaviour, so non-termination is ok, right? :P

Brucite answered 6/10, 2011 at 0:3 Comment(0)
F
1
def bitN (l: Long, i: Int) : Long = {
  def bitI (l: Long, i: Int) : Long = 
    if (i == 0) 1L else 
    2 * { 
      if (l % 2 == 0) bitI (l / 2, i) else bitI (l /2, i-1) 
    }
  bitI (l, i) / 2
}

A recursive method (in scala). Decrement i, the position, if a modulo2 is 1. While returning, multiply by 2. Since the multiplication is invoced as last operation, it is not tail recursive, but since Longs are of known size in advance, the maximum stack is not too big.

scala> n.toBinaryString.replaceAll ("(.{8})", "$1 ")
res117: java.lang.String = 10110011 11101110 01011110 01111110 00111101 11100101 11101011 011000

scala> bitN (n, 40) .toBinaryString.replaceAll ("(.{8})", "$1 ")
res118: java.lang.String = 10000000 00000000 00000000 00000000 00000000 00000000 00000000 000000
Fao answered 6/10, 2011 at 6:20 Comment(0)
B
1

Edit

After giving it some thought and using the __builtin_popcount function, I figured it might be better to decide on the relevant byte and then compute the whole result instead of incrementally adding/subtracting numbers. Here is an updated version:

int GetBitAtPosition(unsigned i, unsigned n)
{
    unsigned bitCount;

    bitCount = __builtin_popcount(i & 0x00ffffff);
    if (bitCount <= n)
    {
        return (24 + LUT_BitPosition[i >> 24][n - bitCount]);
    }

    bitCount = __builtin_popcount(i & 0x0000ffff);
    if (bitCount <= n)
    {
        return (16 + LUT_BitPosition[(i >> 16) & 0xff][n - bitCount]);
    }

    bitCount = __builtin_popcount(i & 0x000000ff);
    if (bitCount <= n)
    {
        return (8 + LUT_BitPosition[(i >> 8) & 0xff][n - bitCount]);
    }

    return LUT_BitPosition[i & 0xff][n];
}

I felt like creating a LUT based solution where the number is inspected in byte-chunks, however, the LUT for the n-th bit position grew quite large (256*8) and the LUT-free version that was discussed in the comments might be better.

Generally the algorithm would look like this:

unsigned i = 0x000006B5;
unsigned n = 4;
unsigned result = 0;
unsigned bitCount;
while (i)
{
    bitCount = LUT_BitCount[i & 0xff];
    if (n < bitCount)
    {
        result += LUT_BitPosition[i & 0xff][n];
        break; // found
    }
    else
    {
        n -= bitCount;
        result += 8;
        i >>= 8;
    }
}

Might be worth to unroll the loop into its up to 4 iterations to get the best performance on 32 bit numbers.

The LUT for bitcount (could be replaced by __builtin_popcount):

unsigned LUT_BitCount[] = {
    0, 1, 1, 2, 1, 2, 2, 3, // 0-7

    1, 2, 2, 3, 2, 3, 3, 4, // 8-15

    1, 2, 2, 3, 2, 3, 3, 4, // 16-23
    2, 3, 3, 4, 3, 4, 4, 5, // 24-31

    1, 2, 2, 3, 2, 3, 3, 4, // 32-39
    2, 3, 3, 4, 3, 4, 4, 5, // 40-47
    2, 3, 3, 4, 3, 4, 4, 5, // 48-55
    3, 4, 4, 5, 4, 5, 5, 6, // 56-63

    1, 2, 2, 3, 2, 3, 3, 4, // 64-71
    2, 3, 3, 4, 3, 4, 4, 5, // 72-79
    2, 3, 3, 4, 3, 4, 4, 5, // 80-87
    3, 4, 4, 5, 4, 5, 5, 6, // 88-95
    2, 3, 3, 4, 3, 4, 4, 5, // 96-103
    3, 4, 4, 5, 4, 5, 5, 6, // 104-111
    3, 4, 4, 5, 4, 5, 5, 6, // 112-119
    4, 5, 5, 6, 5, 6, 6, 7, // 120-127

    1, 2, 2, 3, 2, 3, 3, 4, // 128
    2, 3, 3, 4, 3, 4, 4, 5, // 136
    2, 3, 3, 4, 3, 4, 4, 5, // 144
    3, 4, 4, 5, 4, 5, 5, 6, // 152
    2, 3, 3, 4, 3, 4, 4, 5, // 160
    3, 4, 4, 5, 4, 5, 5, 6, // 168
    3, 4, 4, 5, 4, 5, 5, 6, // 176
    4, 5, 5, 6, 5, 6, 6, 7, // 184
    2, 3, 3, 4, 3, 4, 4, 5, // 192
    3, 4, 4, 5, 4, 5, 5, 6, // 200
    3, 4, 4, 5, 4, 5, 5, 6, // 208
    4, 5, 5, 6, 5, 6, 6, 7, // 216
    3, 4, 4, 5, 4, 5, 5, 6, // 224
    4, 5, 5, 6, 5, 6, 6, 7, // 232
    4, 5, 5, 6, 5, 6, 6, 7, // 240
    5, 6, 6, 7, 6, 7, 7, 8, // 248-255
};

The LUT for bit position within a byte:

unsigned LUT_BitPosition[][8] = {
    // 0-7
    {UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},

    // 8-15
    {3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},

    // 16-31
    {4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,4,UINT_MAX,UINT_MAX,UINT_MAX},

    // 32-63
    {5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,5,UINT_MAX,UINT_MAX,UINT_MAX},
    {4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,4,5,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,4,5,UINT_MAX,UINT_MAX},

    // 64-127
    {6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,4,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,4,6,UINT_MAX,UINT_MAX},
    {5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,5,6,UINT_MAX,UINT_MAX},
    {4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,4,5,6,UINT_MAX,UINT_MAX},
    {3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,4,5,6,UINT_MAX,UINT_MAX},
    {2,3,4,5,6,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,4,5,6,UINT_MAX,UINT_MAX},
    {1,2,3,4,5,6,UINT_MAX,UINT_MAX},
    {0,1,2,3,4,5,6,UINT_MAX},

    // 128-255
    {7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,4,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,4,7,UINT_MAX,UINT_MAX},
    {5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,5,7,UINT_MAX,UINT_MAX},
    {4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,4,5,7,UINT_MAX,UINT_MAX},
    {3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,4,5,7,UINT_MAX,UINT_MAX},
    {2,3,4,5,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,4,5,7,UINT_MAX,UINT_MAX},
    {1,2,3,4,5,7,UINT_MAX,UINT_MAX},
    {0,1,2,3,4,5,7,UINT_MAX},
    {6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,3,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,3,6,7,UINT_MAX,UINT_MAX},
    {4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,4,6,7,UINT_MAX,UINT_MAX},
    {3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,4,6,7,UINT_MAX,UINT_MAX},
    {2,3,4,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,4,6,7,UINT_MAX,UINT_MAX},
    {1,2,3,4,6,7,UINT_MAX,UINT_MAX},
    {0,1,2,3,4,6,7,UINT_MAX},
    {5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,2,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,2,5,6,7,UINT_MAX,UINT_MAX},
    {3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,3,5,6,7,UINT_MAX,UINT_MAX},
    {2,3,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,3,5,6,7,UINT_MAX,UINT_MAX},
    {1,2,3,5,6,7,UINT_MAX,UINT_MAX},
    {0,1,2,3,5,6,7,UINT_MAX},
    {4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {1,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,1,4,5,6,7,UINT_MAX,UINT_MAX},
    {2,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,2,4,5,6,7,UINT_MAX,UINT_MAX},
    {1,2,4,5,6,7,UINT_MAX,UINT_MAX},
    {0,1,2,4,5,6,7,UINT_MAX},
    {3,4,5,6,7,UINT_MAX,UINT_MAX,UINT_MAX},
    {0,3,4,5,6,7,UINT_MAX,UINT_MAX},
    {1,3,4,5,6,7,UINT_MAX,UINT_MAX},
    {0,1,3,4,5,6,7,UINT_MAX},
    {2,3,4,5,6,7,UINT_MAX,UINT_MAX},
    {0,2,3,4,5,6,7,UINT_MAX},
    {1,2,3,4,5,6,7,UINT_MAX},
    {0,1,2,3,4,5,6,7},
};
Beckon answered 3/8, 2017 at 12:49 Comment(0)
E
1

My approach is to calculate the population count for each 8-bit quarters of the 32-bit integer in parallel, then find which quarter contains the nth bit. The population count of quarters that are lower than the found one can be summarized as the initial value of later calculation.

After that count set bits one-by-one until the n is reached. Without branches and using an incomplete implementation of population count algorithm, my example is the following:

#include <stdio.h>
#include <stdint.h>

int main() {
    uint32_t n = 10, test = 3124375902u; /* 10111010001110100011000101011110 */
    uint32_t index, popcnt, quarter = 0, q_popcnt;

    /* count set bits of each quarter of 32-bit integer in parallel */
    q_popcnt = test - ((test >> 1) & 0x55555555);
    q_popcnt = (q_popcnt & 0x33333333) + ((q_popcnt >> 2) & 0x33333333);
    q_popcnt = (q_popcnt + (q_popcnt >> 4)) & 0x0F0F0F0F;

    popcnt = q_popcnt;

    /* find which quarters can be summarized and summarize them */
    quarter += (n + 1 >= (q_popcnt & 0xff));
    quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 8) & 0xff));
    quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 16) & 0xff));
    quarter += (n + 1 >= ((q_popcnt += q_popcnt >> 24) & 0xff));

    popcnt &= (UINT32_MAX >> (8 * quarter));
    popcnt = (popcnt * 0x01010101) >> 24;

    /* find the index of nth bit in quarter where it should be */
    index = 8 * quarter;
    index += ((popcnt += (test >> index) & 1) <= n);
    index += ((popcnt += (test >> index) & 1) <= n);
    index += ((popcnt += (test >> index) & 1) <= n);
    index += ((popcnt += (test >> index) & 1) <= n);
    index += ((popcnt += (test >> index) & 1) <= n);
    index += ((popcnt += (test >> index) & 1) <= n);
    index += ((popcnt += (test >> index) & 1) <= n);
    index += ((popcnt += (test >> index) & 1) <= n);

    printf("index = %u\n", index);
    return 0;
}

A simple approach which uses loops and conditionals can be the following as well:

#include <stdio.h>
#include <stdint.h>

int main() {
    uint32_t n = 11, test = 3124375902u; /* 10111010001110100011000101011110 */
    uint32_t popcnt = 0, index = 0;
    while(popcnt += ((test >> index) & 1), popcnt <= n && ++index < 32);

    printf("index = %u\n", index);
    return 0;
}
Estellaestelle answered 3/8, 2017 at 14:14 Comment(0)
L
1

I know the question asks for something faster than a loop, but a complicated loop-less answer is likely to take longer than a quick loop.

If the computer has 32 bit ints and v is a random value then it might have for example 16 ones and if we are looking for a random place among the 16 ones, we might typically be looking for the 8th one. 7 or 8 times round a loop with just a couple of statements isn't too bad.

int findNthBit(unsigned int n, int v)
{
    int next;
    if (n > __builtin_popcount(v)) return 0;
    while (next = v&v-1, --n) 
    {
        v = next;
    }
    return v ^ next;
}

The loop works by removing the lowest set bit (n-1) times. The n'th one bit that would be removed is the one bit we were looking for.

If anybody wants to test this ....

#include "stdio.h"
#include "assert.h"

// function here

int main() {
    assert(findNthBit(1, 0)==0);
    assert(findNthBit(1, 0xf0f)==1<<0);
    assert(findNthBit(2, 0xf0f)==1<<1);
    assert(findNthBit(3, 0xf0f)==1<<2);
    assert(findNthBit(4, 0xf0f)==1<<3);
    assert(findNthBit(5, 0xf0f)==1<<8);
    assert(findNthBit(6, 0xf0f)==1<<9);
    assert(findNthBit(7, 0xf0f)==1<<10);
    assert(findNthBit(8, 0xf0f)==1<<11);
    assert(findNthBit(9, 0xf0f)==0);
    printf("looks good\n");
}

If there are concerns about the number of times the loop is executed, for example if the function is regularly called with large values of n, its simple to add an extra line or two of the following form

if (n > 8) return findNthBit(n-__builtin_popcount(v&0xff), v>>8)  << 8;

or

 if (n > 12) return findNthBit(n - __builtin_popcount(v&0xfff),  v>>12) << 12;

The idea here is that the n'th one will never be located in the bottom n-1 bits. A better version clears not only the bottom 8 or 12 bits, but all the bottom (n-1) bits when n is large-ish and we don't want to loop that many times.

if (n > 7) return findNthBit(n - __builtin_popcount(v & ((1<<(n-1))-1)), v>>(n-1)) << (n-1);

I tested this with findNthBit(20, 0xaf5faf5f) and after clearing out the bottom 19 bits because the answer wasn't to be found there, it looked for the 5th bit in the remaining bits by looping 4 times to remove 4 ones.

So an improved version is

int findNthBit(unsigned int n, int v)
{
    int next;
    if (n > __builtin_popcount(v)) return 0;
    if (n > 7) return findNthBit(n - __builtin_popcount(v & ((1<<(n-1))-1)), v>>(n-1)) << (n-1);
    while (next = v&v-1, --n) 
    {
        v = next;
    }
    return v ^ next;
}

The value 7, limiting looping is chosen fairly arbitrarily as a compromise between limiting looping and limiting recursion. The function could be further improved by removing recursion and keeping track of a shift amount instead. I may try this if I get some peace from home schooling my daughter!

Here is a final version with the recursion removed by keeping track of the number of low order bits shifted out from the bottom of the bits being searched.

Final version

int findNthBit(unsigned int n, int v)
{
    int shifted = 0; // running total
    int nBits;       // value for this iteration

    // handle no solution
    if (n > __builtin_popcount(v)) return 0;

    while (n > 7) 
    {
        // for large n shift out lower n-1 bits from v. 
        nBits = n-1;
        n -= __builtin_popcount(v & ((1<<nBits)-1));
        v >>= nBits;
        shifted += nBits;
    }

    int next;
    // n is now small, clear out n-1 bits and return the next bit
    // v&(v-1): a well known software trick to remove the lowest set bit.
    while (next = v&(v-1), --n) 
    {
        v = next;
    }
    return (v ^ next) << shifted;
}
Leger answered 26/2, 2021 at 16:40 Comment(0)
K
0

Building on the answer given by Jukka Suomela, which uses a machine-specific instruction that may not necessarily be available, it is also possible to write a function that does exactly the same thing as _pdep_u64 without any machine dependencies. It must loop over the set bits in one of the arguments, but can still be described as a constexpr function for C++11.

constexpr inline uint64_t deposit_bits(uint64_t x, uint64_t mask, uint64_t b, uint64_t res) {
    return mask != 0 ? deposit_bits(x, mask & (mask - 1), b << 1, ((x & b) ? (res | (mask & (-mask))) : res)) : res;
}

constexpr inline uint64_t nthset(uint64_t x, unsigned n)  {
    return deposit_bits(1ULL << n, x, 1, 0);
}
Kayceekaye answered 13/8, 2016 at 6:22 Comment(0)
O
0

Based on a method by Juha Järvi published in the famous Bit Twiddling Hacks, I tested this implementation where n and i are used as in the question:

    a = i - (i >> 1 & 0x55555555);
    b = (a & 0x33333333) + (a >> 2 & 0x33333333);
    c = b + (b >> 4) & 0x0f0f0f0f;

    r = n + 1;
    s = 0;
    t = c + (c >> 8) & 0xff;

    if (r > t) {
        s += 16;
        r -= t;
    }

    t = c >> s & 0xf;

    if (r > t) {
        s += 8;
        r -= t;
    }

    t = b >> s & 0x7;

    if (r > t) {
        s += 4;
        r -= t;
    }

    t = a >> s & 0x3;

    if (r > t) {
        s += 2;
        r -= t;
    }

    t = i >> s & 0x1;

    if (r > t)
        s++;

    return (s);

Based on my own tests, this is about as fast as the loop on x86, whereas it is 20% faster on arm64 and probably a lot faster on arm due to the fast conditional instructions, but I can't test this right now.

Opuntia answered 3/8, 2017 at 12:49 Comment(11)
The conditionals are slow on x86 and x86-64; see my version for the faster (no conditionals) version. For uniform random 32-bit inputs, it is measurably faster than the loop version. Also note that both my versions return 32 for the case where n is at least the number of set bits in the value; both versions are sped up a little bit more, if you know you are only interested in cases where n is less than the number of bits set.Buttonhook
@NominalAnimal clang compiles the conditionals into very fast conditional moves, which is why I'm not sure if bit fiddling is actually an advantage. Let me do some testing in this regard.Opuntia
A quick microbenchmark I did (just doing the operation in a loop to a million arguments) is problematic, because on the architecture I have access to (Intel Core i5), latencies dominate. It would be much more realistic and informative to combine the lookup with some sort of realistic computation using the result for each operation.Buttonhook
@NominalAnimal Try my permcode microbenchmark. The implementation "bitcount" contains the code from this answer.Opuntia
I get baseline 0.0571s 0.0551s 0.1122s 5.7085ns 5.5112ns 11.2197ns count 2.2034s 2.0861s 4.2895s 220.3420ns 208.6121ns 428.9541ns bitcount 2.0746s 0.2668s 2.3414s 207.4622ns 26.6761ns 234.1383ns bitcount2 1.9177s 0.2677s 2.1854s 191.7716ns 26.7723ns 218.5439ns popcount 1.6486s 0.0985s 1.7471s 164.8589ns 9.8484ns 174.7073ns where bitcount2 replaces the conditionals with bit hacks. (gcc-5.4.0, -O3 -march=native -mtune=native, on Intel Core i5-4200U.) Care to describe the columns (in the README)?Buttonhook
@NominalAnimal The first three columns are "decode/encode/decode+encode" total, the last three are per operation. Baseline exists to measure the function call overhead.Opuntia
@NominalAnimal Interesting! So neither of them beat the “slow” reference implementation I gave on your machine.Opuntia
Fastest was bmi2 0.1173s 0.0942s 0.2114s 11.7251ns 9.4196ns 21.1447ns, followed by shuffle 0.2871s 0.4735s 0.7606s 28.7095ns 47.3469ns 76.0564ns, vector 0.4970s 0.3811s 0.8781s 49.6953ns 38.1108ns 87.8061ns, and popcount 1.6787s 0.0978s 1.7766s 167.8728ns 9.7839ns 177.6567ns. On this run, baseline 0.0546s 0.0571s 0.1117s 5.4553ns 5.7139ns 11.1692ns, bitcount2 1.9105s 0.2658s 2.1762s 191.0461ns 26.5771ns 217.6232ns, and bitcount 2.0595s 0.2657s 2.3252s 205.9494ns 26.5750ns 232.5244ns.Buttonhook
@Nominal Note that shuffle and vector do not apply to this question.Opuntia
Yet, you included it in the benchmark. Harrumph. :) Actually, I do believe your benchmark is also "suffering" from lack of meaningful other work done with the results. While the microbenchmark in itself is quite correct, it may not reflect the timing behaviour of the functions in real world application code. (Especially on complex architectures like x86-64.) In other words, I am not sure if the microbenchmark results are reliable when considering which implementation to use in a real world application. My suggestion: use a compile-time switch (#ifdef .. #endif) in your actual application.Buttonhook
@NominalAnimal That's because the benchmark actually tests algorithms for some other problem which can be solved by repeated solution of this question but can also be solved with other approaches. I round trip the data through an algorithm and an inverse algorithm, so there is actually reasonable work being done.Opuntia
E
0

PDEP solution is great, but some languages like Java do not contain this intrinsic yet, however, are efficient in the other low-level operations. So I came up with the following fall back for such cases: a branchless binary search.

// n must be using 0-based indexing.
// This method produces correct results only if n is smaller
// than the number of set bits.
public static int getNthSetBit(long mask64, int n) {
    // Binary search without branching
    int base = 0;
    final int low32 = (int) mask64;
    final int high32n = n - Integer.bitCount(low32);
    final int inLow32 = high32n >>> 31;
    final int inHigh32 = inLow32 ^ 1;
    final int shift32 = inHigh32 << 5;
    final int mask32 = (int) (mask64 >>> shift32);
    n = ((-inLow32) & n) | ((-inHigh32) & high32n);
    base += shift32;

    final int low16 = mask32 & 0xffff;
    final int high16n = n - Integer.bitCount(low16);
    final int inLow16 = high16n >>> 31;
    final int inHigh16 = inLow16 ^ 1;
    final int shift16 = inHigh16 << 4;
    final int mask16 = (mask32 >>> shift16) & 0xffff;
    n = ((-inLow16) & n) | ((-inHigh16) & high16n);
    base += shift16;

    final int low8 = mask16 & 0xff;
    final int high8n = n - Integer.bitCount(low8);
    final int inLow8 = high8n >>> 31;
    final int inHigh8 = inLow8 ^ 1;
    final int shift8 = inHigh8 << 3;
    final int mask8 = (mask16 >>> shift8) & 0xff;
    n = ((-inLow8) & n) | ((-inHigh8) & high8n);
    base += shift8;

    final int low4 = mask8 & 0xf;
    final int high4n = n - Integer.bitCount(low4);
    final int inLow4 = high4n >>> 31;
    final int inHigh4 = inLow4 ^ 1;
    final int shift4 = inHigh4 << 2;
    final int mask4 = (mask8 >>> shift4) & 0xf;
    n = ((-inLow4) & n) | ((-inHigh4) & high4n);
    base += shift4;

    final int low2 = mask4 & 3;
    final int high2n = n - (low2 >> 1) - (low2 & 1);
    final int inLow2 = high2n >>> 31;
    final int inHigh2 = inLow2 ^ 1;
    final int shift2 = inHigh2 << 1;
    final int mask2 = (mask4 >>> shift2) & 3;
    n = ((-inLow2) & n) | ((-inHigh2) & high2n);
    base += shift2;

    // For the 2 bits remaining, we can take a shortcut
    return base + (n | ((mask2 ^ 1) & 1));
}
Extraterrestrial answered 30/9, 2022 at 1:31 Comment(0)
H
0

This may be the solution you were looking for, assuming you can't use BMI extensions.

uint64_t nth_set_fast (uint64_t m, int n) {

    // count set bits in every block of 7
    uint64_t pc = (m &~0xAA54A952A54A952A) + ((m &0xAA54A952A54A952A)>>1);
             pc = (pc&~0xCC993264C993264C) + ((pc&0xCC993264C993264C)>>2);
             pc = (pc&~0xF0E1C3870E1C3870) + ((pc&0xF0E1C3870E1C3870)>>4);

    // prefix scan partial sums
    pc *= 0x0102040810204081<<7;

    // copy n to all blocks
    uint64_t nn = uint64_t(n)* 0x0102040810204081;

    // substract nn-pc for each block without carry
    uint64_t ss = nn + (~pc & ~(0x8102040810204081>>1)) + 0x8102040810204081;

    // find correct block
    uint64_t cc= ss & ~(ss>>7) & (0x8102040810204081>>1); cc>>=6;

    // block mask
    uint64_t bb = (cc<<8) -cc; 

    m &= bb; // zero all other blocks

    // xor-prefix scan; select odd/even depending on remainder bit
    uint64_t m0 = clmul(m ,0xFF) & m ; m0 ^=  m  & ( -(ss&cc));
    uint64_t m1 = clmul(m0,0xFF) & m0; m1 ^=  m0 & ( -((ss>>1)&cc));
    uint64_t m2 = clmul(m1,0xFF) & m1; m2 ^=  m1 & ( -((ss>>2)&cc));
    uint64_t m3 = clmul(m2,0xFF) & m2; m3 ^=  m2 & ( -((ss>>3)&cc)); // last step needed because of leftover bit at index 63

    return m3 & bb;
}

// carry-less multiplication; will compile to PCLMULQDQ on x86
uint64_t clmul (uint64_t n, uint64_t m) {

    u64x2 a, b;
    a[0] = n;
    b[0] = m;
    auto r = __builtin_ia32_pclmulqdq128(a,b,0); // immediate:
    return r[0];
}

No loops, no conditional branches, no tables and no BMI intrinsics. With a bit of work it could be optimized so it runs fully on SIMD registers.

In case you don't have access to the carry-less zero intrinsic you can trivially implement it yourself, since the second parameter is fixed and only 8-bit long.

Hoyle answered 23/9, 2023 at 23:50 Comment(0)
A
0

Since I was about to post a fairly portable answer to this related question is there a non-iterative way of finding the index of the Nth set bit? before it got closed, I thought I'd post it here instead in case someone else finds it interesting or useful. I found it to be about 25% faster on average in a hot loop than the iterative bit removing algorithm shown in the linked question, provided it was compiled with -msse4. It uses a look-up table so there's the old argument about hot caches.

uint64_t pos_of_nth_bit2(uint64_t X, uint64_t bit) {
  // Requires that __builtin_popcountll(X) > bit.
  int32_t testx, pos, pop;
  int8_t lut[4][16] = {{0,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0},
                       {0,0,0,1,0,2,2,1,0,3,3,1,3,2,2,1},
                       {0,0,0,0,0,0,0,2,0,0,0,3,0,3,3,2},
                       {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3}};
  _Bool test;
  pos = 0;
  pop = __builtin_popcount(X & 0xffffffffUL);
  test = pop <= bit;
  bit -= test*pop;
  testx = test*32;
  X >>= testx;
  pos += testx;
  pop = __builtin_popcount(X & 0xffffUL);
  test = pop <= bit;
  bit -= test*pop;
  testx = test*16;
  X >>= testx;
  pos += testx;
  pop = __builtin_popcount(X & 0xffUL);
  test = pop <= bit;
  bit -= test*pop;
  testx = test*8;
  X >>= testx;
  pos += testx;
  pop = __builtin_popcount(X & 0xfUL);
  test = pop <= bit;
  bit -= test*pop;
  testx = test*4;
  X >>= testx;
  pos += testx;
  return pos + lut[bit][X & 0xf];
}
Amsterdam answered 13/11, 2023 at 22:57 Comment(0)
N
0

I implemented a few of the ideas mentioned in this thread (in Zig). Here was my take on a carryless multiply solution. See here for diagram.

fn select(m: u64, n: u6) ?u6 {
    var x: @Vector(2, u64) = select_vec_helper(m);
    const y_vec: [2]@Vector(2, u64) = [2]@Vector(2, u64){
        [2]u64{ ~@as(u64, 0) ^ 0, ~@as(u64, 0) },
        [2]u64{ ~@as(u64, 0) ^ 1, ~@as(u64, 0) },
    };

    inline for (0..5) |i| {
        const y = y_vec[(n >> i) & 1];
        const prefix_xor = switch (builtin.cpu.arch) {
            .x86_64 => asm (
                \\ vpclmulqdq $0x00, %[x], %[y], %[out]
                : [out] "=x" (-> @Vector(2, u64)),
                : [x] "x" (x),
                  [y] "x" (y),
            ),

            .aarch64 => asm (
                \\ pmull %[out].1q, %[x].1d, %[y].1d
                : [out] "=w" (-> @Vector(2, u64)),
                : [x] "w" (x),
                  [y] "w" (y),
            ),

            else => unreachable,
        };

        x &= prefix_xor;
    }

    var z = x[0];
    // optimization: do last iteration without carryless multiply
    // it is not hard to unset the lowest set bit
    z &= (z & (~z +% 1)) ^ (~@as(u64, (n >> 5) & 1) +% 1);
    if (z == 0) return null;
    return @intCast(@ctz(z));
}

inline fn select_vec_helper(m: anytype) if (@inComptime() or !has_native_carryless_multiply or (@bitSizeOf(@TypeOf(m)) > 64 and @TypeOf(m) != @Vector(2, u64)))
    @TypeOf(m)
else
    @Vector(2, u64) {
    if (@inComptime() or !has_native_carryless_multiply or @bitSizeOf(@TypeOf(m)) > 64 or @TypeOf(m) == @Vector(2, u64)) return m;

    return switch (builtin.cpu.arch) {
        .x86_64 => [2]u64{ m, 0 },
        .aarch64 => [2]u64{ m, m },
        else => unreachable,
    };
}

This one was not that performant, unfortunately, so I implemented Tobias Ribizel's idea utilizing vectors.

fn select(m: u64, n: u6) ?u6 {
    // Could use some of the optimizations from the final implementation
    // but I don't think it would totally close the perf gap no matter what
    comptime var PRECOMP: [256][8]u6 = undefined;

    comptime {
        @setEvalBranchQuota(std.math.maxInt(u32));
        for (&PRECOMP, 0..) |*slot, i| {
            for (slot, 0..) |*sub_slot, j| {
                sub_slot.* = select1(i, j) orelse 0;
            }
        }
    }

    const v: @Vector(8, u8) = @bitCast(m);
    const prefix_sums: @Vector(8, u8) = @bitCast(@as(u64, @bitCast(@as(@Vector(8, u8), @popCount(v)))) *% 0x0101010101010101);
    const broadcasted = @as(@Vector(8, u8), @splat(@as(u8, n)));
    const mov_mask: u64 = @as(u8, @bitCast(prefix_sums <= broadcasted));
    if (~mov_mask == 0) return null;
    const byte_index: u6 = @intCast(@popCount(mov_mask));
    const prefix_sum = if (byte_index == 0) 0 else prefix_sums[byte_index - 1];
    return PRECOMP[v[byte_index]][n - prefix_sum] + (byte_index << 3);
}

Better, but actually the best implementation is to do the same idea but using SWAR techniques. Gog & Petri had this idea many years ago, improved a little by Sebastiano Vigna and Giuseppe Ottaviano made some improvements when creating ot/succinct. I noticed a few minor optimizations on top of Ottaviano's implementation, here.

Here is the fastest implementation I settled on. (Here is a gist link I might update with a 32-bit fallback at some point)

const std = @import("std");
const builtin = @import("builtin");
const assert = std.debug.assert;

inline fn pdep(src: u64, mask: u64) u64 {
    return asm ("pdep %[mask], %[src], %[ret]"
        : [ret] "=r" (-> u64),
        : [src] "r" (src),
          [mask] "r" (mask),
    );
}

const USE_POPCNT = switch (builtin.cpu.arch) {
    .aarch64_32, .aarch64_be, .aarch64 => false,
    .mips, .mips64, .mips64el, .mipsel => std.Target.mips.featureSetHas(builtin.cpu.features, .cnmips),
    .powerpc, .powerpc64, .powerpc64le, .powerpcle => std.Target.powerpc.featureSetHas(builtin.cpu.features, .popcntd),
    .s390x => std.Target.s390x.featureSetHas(builtin.cpu.features, .miscellaneous_extensions_3),
    .ve => true,
    .avr => true,
    .msp430 => true,
    .riscv32, .riscv64 => std.Target.riscv.featureSetHas(builtin.cpu.features, .zbb),
    .sparc, .sparc64, .sparcel => std.Target.sparc.featureSetHas(builtin.cpu.features, .popc),
    .wasm32, .wasm64 => true,
    .x86, .x86_64 => std.Target.x86.featureSetHas(builtin.cpu.features, .popcnt),
    else => false,
};

/// Returns the the position of the nth bit of m (0-indexed, 64 is the error condition, i.e. when @popCount(m) <= n)
fn select(m: u64, n: u6) u8 {
    const USE_LOOKUP_TABLE = true;

    const cpu_name = builtin.cpu.model.llvm_name orelse builtin.cpu.model.name;
    if (!@inComptime() and comptime builtin.cpu.arch == .x86_64 and
        std.Target.x86.featureSetHas(builtin.cpu.features, .bmi2) and
        // pdep is microcoded (slow) on AMD architectures before Zen 3.
        (!std.mem.startsWith(u8, cpu_name, "znver") or cpu_name["znver".len] >= '3'))
    {
        return @ctz(pdep(@as(u64, 1) << n, m));
    }

    comptime var lookup_table: [256][8]u8 = undefined;

    comptime if (USE_LOOKUP_TABLE) {
        @setEvalBranchQuota(std.math.maxInt(u32));
        for (&lookup_table, 0..) |*slot, i| {
            for (slot, 0..) |*sub_slot, j| {
                sub_slot.* = selectByte(i, j);
            }
        }
    };

    const ones: u64 = 0x0101010101010101;

    var i = m;
    i -= (i >> 1) & 0x5555555555555555;
    i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
    const prefix_sums = (((i + (i >> 4)) & 0x0F0F0F0F0F0F0F0F) *% ones);
    assert((prefix_sums & 0x8080808080808080) == 0);

    const broadcasted = ones * (@as(u64, n) | 0x80);
    const bit_isolate = ones * if (USE_POPCNT) 0x80 else 0x01;
    const mask = ((broadcasted - prefix_sums) >> if (USE_POPCNT) 0 else 7) & bit_isolate;

    // prove it is safe to optimize (x >> 56) << 3 to (x >> 53)
    const max_byte_index = @as(u64, ones) *% ones;
    assert(((max_byte_index >> 53) & 0b111) == 0);

    if (mask == bit_isolate) return 64;

    const byte_index: u6 = if (USE_POPCNT)
        @intCast(@popCount(mask) << 3)
    else
        @intCast((mask *% ones) >> 53);

    const prefix_sum: u6 = @truncate(prefix_sums << 8 >> byte_index);
    const target_byte: u8 = @truncate(m >> byte_index);
    const n_for_target_byte: u3 = @intCast(n - prefix_sum);

    return if (USE_LOOKUP_TABLE)
        lookup_table[target_byte][n_for_target_byte] + byte_index
    else
        selectByte(target_byte, @intCast(n_for_target_byte)) + byte_index;
}

fn selectByte(m: u8, n: u3) u4 {
    const ones: u64 = 0x0101010101010101;
    const unique_bytes: u64 = 0x8040_2010_0804_0201;
    const unique_bytes_diff_from_msb = (ones * 0x80) - unique_bytes;
    const prefix_sums = (((((m *% ones) & unique_bytes) + unique_bytes_diff_from_msb) >> 7) & ones) *% ones;

    const broadcasted = ones * (@as(u64, n) | 0x80);
    const bit_isolate = ones * if (USE_POPCNT) 0x80 else 0x01;
    const mask = (((broadcasted - prefix_sums) >> if (USE_POPCNT) 0 else 7) & bit_isolate);

    if (mask == bit_isolate) return 8;

    const bit_index: u3 = if (USE_POPCNT)
        @intCast(@popCount(mask))
    else
        @intCast((mask *% ones) >> 56);

    return bit_index;
}

If anyone would like to play around with my test-bench, it is available here. If you install Zig, you can do zig test ./path-to-file.zig to run the test or zig run ./path-to-file.zig -O ReleaseFast to run the main function with all the optimizations turned on. Here is a Godbolt link.

Nubbin answered 3/1, 2024 at 14:20 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.