Bit twiddling: which bit is set?
Asked Answered
F

15

35

I have a 64-bit unsigned integer with exactly 1 bit set. I’d like to assign a value to each of the possible 64 values (in this case, the odd primes, so 0x1 corresponds to 3, 0x2 corresponds to 5, …, 0x8000000000000000 corresponds to 313).

It seems like the best way would be to convert 1 → 0, 2 → 1, 4 → 2, 8 → 3, …, 263 → 63 and look up the values in an array. But even if that’s so, I’m not sure what the fastest way to get at the binary exponent is. And there may be more efficient ways, still.

This operation will be used 1014 to 1016 times, so performance is a serious issue.

Foochow answered 12/8, 2010 at 6:3 Comment(5)
"This operation will be used 10^14 to 10^16 times, so performance is a serious issue." Neat! +1 just for that.Fugacity
This is essentially the same as #3272924Master
The fastest method may require CPU specific instructions.Fortunetelling
If you have a guarantee that exactly one bit is set, there may be a solution involving only a single (integer) multiply or multiply and table lookup. I'll have to think about it, or perhaps someone else will think about it first.Railroader
And I just found it, on the famous bit-twiddling page. See my newest answer below.Railroader
L
36

If performance is a serious issue, then you should use intrinsics/builtins to use CPU specific instructions, such as the ones found here for GCC:

http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Other-Builtins.html

  • Built-in function int __builtin_ffs(unsigned int x).

    Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.

  • Built-in function int __builtin_clz(unsigned int x).

    Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

  • Built-in function int __builtin_ctz(unsigned int x).

    Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.

Things like this are the core of many O(1) algorithms, such as kernel schedulers which need to find the first non-empty queue signified by an array of bits.

Note: I’ve listed the unsigned int versions, but GCC has unsigned long long versions, as well.

Levana answered 12/8, 2010 at 6:11 Comment(4)
yup this is how you should do itJacquenette
The MSVC equivilents would be the BitScanForward64 and BitScanReverse64 intrinsics, there are also versions for systems not supported by either of the two here: chessprogramming.wikispaces.com/BitScanNoranorah
Is there a page somewhere that explains, for different chips, which intrinsics map to processor instructions and which use library calls?Foochow
Well this page (gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/…) has a listing of target specific intrinsics. I assume that the ones on the link I put in my response are available on all supported platforms since they don't specify the architecture. I think that for the most part, they won't rely on library calls (if ever), at worst I would expect that it would just put some inline code where it is used. Though you can't be sure until you test (or some docs say specifically).Levana
R
41

Finally an optimal solution. See the end of this section for what to do when the input is guaranteed to have exactly one non-zero bit: http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn

Here's the code:

static const int MultiplyDeBruijnBitPosition2[32] = 
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition2[(uint32_t)(v * 0x077CB531U) >> 27];

You may be able to adapt this to a direct multiplication-based algorithm for 64-bit inputs; otherwise, simply add one conditional to see if the bit is in the upper 32 positions or the lower 32 positions, then use the 32-bit algorithm here.

Update: Here's at least one 64-bit version I just developed myself, but it uses division (actually modulo).

r = Table[v%67];

For each power of 2, v%67 has a distinct value, so just put your odd primes (or bit indices if you don't want the odd-prime thing) at the right positions in the table. 3 positions (0, 17, and 34) are not used, which might be convenient if you also want to accept all-bits-zero as an input.

Update 2: 64-bit version.

r = Table[(uint64_t)(val * 0x022fdd63cc95386dull) >> 58];

This is my original work, but I got the B(2,6) De Bruijn sequence from this chess site so I can't take credit for anything but figuring out what a De Bruijn sequence is and using Google. ;-)

Some additional remarks on how this works:

The magic number is a B(2,6) De Bruijn sequence. It has the property that, if you look at a 6-consecutive-bit window, you can obtain any six-bit value in that window by rotating the number appropriately, and that each possible six-bit value is obtained by exactly one rotation.

We fix the window in question to be the top 6 bit positions, and choose a De Bruijn sequence with 0's in the top 6 bits. This makes it so we never have to deal with bit rotations, only shifts, since 0's will come into the bottom bits naturally (and we could never end up looking at more than 5 bits from the bottom in the top-6-bits window).

Now, the input value of this function is a power of 2. So multiplying the De Bruijn sequence by the input value performs a bitshift by log2(value) bits. We now have in the upper 6 bits a number which uniquely determines how many bits we shifted by, and can use that as an index into a table to get the actual length of the shift.

This same approach can be used for arbitrarily-large or arbitrarily-small integers, as long as you're willing to implement the multiplication. You simply have to find a B(2,k) De Bruijn sequence where k is the number of bits. The chess wiki link I provided above has De Bruijn sequences for values of k ranging from 1 to 6, and some quick Googling shows there are a few papers on optimal algorithms for generating them in the general case.

Railroader answered 12/8, 2010 at 7:2 Comment(7)
nice! I'd just do (uint64_t)0x022fdd63cc95386dull, since, who knows, one day ull will be 128 bit.Illconditioned
@Jens: if you have c99 or c++0x, you could just do this and be safe: UINT64_C(0x022fdd63cc95386d)Levana
Technically omitting the suffix entirely is fine, but gcc gives annoying warnings when you do that unless you add -std=c99. As for the possibility that ull may be 128 bits, it doesn't matter. Any sane compiler will see that both operands fit in 64 bits, do a 64x64 multiply, and discard the upper 64 bits. I did the cast to uint64_t afterwards just in case ull is larger than 64 bits, to make sure the upper bits get discarded.Railroader
Excellent solution. It looks like a gcc builtin will do what I need, but this would be quite fast.Foochow
The builtin (if it uses asm, which I assume it does) is probably slightly faster. I wonder what kind of performance would be possible if Intel/AMD incorporated the De Bruijn sequence algorithm into their cpus, though.. Something to think about. Especially since this kind of computation is pretty important both in pseudo-O(1) kernel scheduling operations and in multimedia codecs.Railroader
Actually, one possible reason to try my answer: multiple computations can probably run in parallel on a single cpu core, either with SIMD or by performing several multiplications before you attempt to use the results of the first one.Railroader
I have some questions related to this, can you take a look on the post here? https://mcmap.net/q/428575/-de-bruijn-algorithm-binary-digit-count-64bits-c/1240074Vouvray
L
36

If performance is a serious issue, then you should use intrinsics/builtins to use CPU specific instructions, such as the ones found here for GCC:

http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Other-Builtins.html

  • Built-in function int __builtin_ffs(unsigned int x).

    Returns one plus the index of the least significant 1-bit of x, or if x is zero, returns zero.

  • Built-in function int __builtin_clz(unsigned int x).

    Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.

  • Built-in function int __builtin_ctz(unsigned int x).

    Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.

Things like this are the core of many O(1) algorithms, such as kernel schedulers which need to find the first non-empty queue signified by an array of bits.

Note: I’ve listed the unsigned int versions, but GCC has unsigned long long versions, as well.

Levana answered 12/8, 2010 at 6:11 Comment(4)
yup this is how you should do itJacquenette
The MSVC equivilents would be the BitScanForward64 and BitScanReverse64 intrinsics, there are also versions for systems not supported by either of the two here: chessprogramming.wikispaces.com/BitScanNoranorah
Is there a page somewhere that explains, for different chips, which intrinsics map to processor instructions and which use library calls?Foochow
Well this page (gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/…) has a listing of target specific intrinsics. I assume that the ones on the link I put in my response are available on all supported platforms since they don't specify the architecture. I think that for the most part, they won't rely on library calls (if ever), at worst I would expect that it would just put some inline code where it is used. Though you can't be sure until you test (or some docs say specifically).Levana
C
15

You could use a binary search technique:

int pos = 0;
if ((value & 0xffffffff) == 0) {
    pos += 32;
    value >>= 32;
}
if ((value & 0xffff) == 0) {
    pos += 16;
    value >>= 16;
}
if ((value & 0xff) == 0) {
    pos += 8;
    value >>= 8;
}
if ((value & 0xf) == 0) {
    pos += 4;
    value >>= 4;
}
if ((value & 0x3) == 0) {
    pos += 2;
    value >>= 2;
}
if ((value & 0x1) == 0) {
    pos += 1;
}

This has the advantage over loops that the loop is already unrolled. However, if this is really performance critical, you will want to test and measure every proposed solution.

Cavetto answered 12/8, 2010 at 6:11 Comment(1)
Nice clean implementation. You could also do it as a loop (with a fixed number of iterations) and let the compiler potentially unroll.Railroader
T
6

Some architectures (a suprising number, actually) have a single instruction that can do the calculation you want. On ARM it would be the CLZ (count leading zeroes) instruction. For intel, the BSF (bit-scan forward) or BSR (bit-scan reverse) instruction would help you out.

I guess this isn't really a C answer, but it will get you the speed you need!

Tc answered 12/8, 2010 at 6:10 Comment(0)
I
2
  • precalculate 1 << i (for i = 0..63) and store them in an array
  • use a binary search to find the index into the array of a given value
  • look up the prime number in another array using this index

Compared to the other answer I posted here, this should only take 6 steps to find the index (as opposed to a maximum of 64). But it's not clear to me whether one step of this answer is not more time consuming than just bit shifting and incrementing a counter. You may want to try out both though.

Impolicy answered 12/8, 2010 at 6:10 Comment(2)
+1 for binary search - and for considering that the solution may be slower than just shifting. Perhaps this problem calls for __asm {} for pure speed.Fugacity
+1 for binary search, but -1 for the ridiculous idea that you need an array to perform a binary search. You can do the binary search directly on the variable and it will actually be fast.Railroader
P
2

See http://graphics.stanford.edu/~seander/bithacks.html - specifically "Finding integer log base 2 of an integer (aka the position of the highest bit set)" - for some alternative algorithsm. (If you're really serious about speed, you might consider ditching C if your CPU has a dedicated instruction).

Paunchy answered 12/8, 2010 at 6:12 Comment(0)
B
2

Since speed, presumably not memory usage, is important, here's a crazy idea:

w1 = 1st 16 bits
w2 = 2nd 16 bits
w3 = 3rd 16 bits
w4 = 4th 16 bits

result = array1[w1] + array2[w2] + array3[w3] + array4[w4]

where array1..4 are sparsely populated 64K arrays that contain the actual prime values (and zero in the positions that don't correspond to bit positions)

Bolshevism answered 12/8, 2010 at 7:17 Comment(2)
Or better yet, result = array1[v&(1<<22)-1] + array2[v>>22&(1<<22)-1] + array3[v>>44]; At such sizes, the array will be sparse even in physical memory, i.e. the majority of the virtual memory will just be references to the zero page.Railroader
Also note that, since only 64 values of v ever occur, the entire table data set should fit in L2 cache and perhaps even L1. Moreover, since you don't care about any inputs but powers of 2, you could arrange it so that all 4/all 3 arrays occupy overlapping space.Railroader
T
2

@Rs solution is excellent this is just the 64 bit variant, with the table already calculated ...

static inline unsigned char bit_offset(unsigned long long self) {
    static const unsigned char mapping[64] = {
        [0]=0,   [1]=1,   [2]=2,   [4]=3,   [8]=4,   [17]=5,  [34]=6,  [5]=7,
        [11]=8,  [23]=9,  [47]=10, [31]=11, [63]=12, [62]=13, [61]=14, [59]=15,
        [55]=16, [46]=17, [29]=18, [58]=19, [53]=20, [43]=21, [22]=22, [44]=23,
        [24]=24, [49]=25, [35]=26, [7]=27,  [15]=28, [30]=29, [60]=30, [57]=31,
        [51]=32, [38]=33, [12]=34, [25]=35, [50]=36, [36]=37, [9]=38,  [18]=39,
        [37]=40, [10]=41, [21]=42, [42]=43, [20]=44, [41]=45, [19]=46, [39]=47,
        [14]=48, [28]=49, [56]=50, [48]=51, [33]=52, [3]=53,  [6]=54,  [13]=55,
        [27]=56, [54]=57, [45]=58, [26]=59, [52]=60, [40]=61, [16]=62, [32]=63
    };
    return mapping[((self & -self) * 0x022FDD63CC95386DULL) >> 58];
}

I built the table using the provided mask.

>>> ', '.join('[{0}]={1}'.format(((2**bit * 0x022fdd63cc95386d) % 2**64) >> 58, bit) for bit in xrange(64))
'[0]=0, [1]=1, [2]=2, [4]=3, [8]=4, [17]=5, [34]=6, [5]=7, [11]=8, [23]=9, [47]=10, [31]=11, [63]=12, [62]=13, [61]=14, [59]=15, [55]=16, [46]=17, [29]=18, [58]=19, [53]=20, [43]=21, [22]=22, [44]=23, [24]=24, [49]=25, [35]=26, [7]=27, [15]=28, [30]=29, [60]=30, [57]=31, [51]=32, [38]=33, [12]=34, [25]=35, [50]=36, [36]=37, [9]=38, [18]=39, [37]=40, [10]=41, [21]=42, [42]=43, [20]=44, [41]=45, [19]=46, [39]=47, [14]=48, [28]=49, [56]=50, [48]=51, [33]=52, [3]=53, [6]=54, [13]=55, [27]=56, [54]=57, [45]=58, [26]=59, [52]=60, [40]=61, [16]=62, [32]=63'

should the compiler complain:

>>> ', '.join(map(str, {((2**bit * 0x022fdd63cc95386d) % 2**64) >> 58: bit for bit in xrange(64)}.values()))
'0, 1, 2, 53, 3, 7, 54, 27, 4, 38, 41, 8, 34, 55, 48, 28, 62, 5, 39, 46, 44, 42, 22, 9, 24, 35, 59, 56, 49, 18, 29, 11, 63, 52, 6, 26, 37, 40, 33, 47, 61, 45, 43, 21, 23, 58, 17, 10, 51, 25, 36, 32, 60, 20, 57, 16, 50, 31, 19, 15, 30, 14, 13, 12'

^^^^ assumes that we iterate over sorted keys, this may not be the case in the future ...

unsigned char bit_offset(unsigned long long self) {
    static const unsigned char table[64] = {
        0, 1, 2, 53, 3, 7, 54, 27, 4, 38, 41, 8, 34, 55, 48,
        28, 62, 5, 39, 46, 44, 42, 22, 9, 24, 35, 59, 56, 49,
        18, 29, 11, 63, 52, 6, 26, 37, 40, 33, 47, 61, 45, 43,
        21, 23, 58, 17, 10, 51, 25, 36, 32, 60, 20, 57, 16, 50,
        31, 19, 15, 30, 14, 13, 12
    };
    return table[((self & -self) * 0x022FDD63CC95386DULL) >> 58];
}

simple test:

>>> table = {((2**bit * 0x022fdd63cc95386d) % 2**64) >> 58: bit for bit in xrange(64)}.values()
>>> assert all(i == table[(2**i * 0x022fdd63cc95386d % 2**64) >> 58] for i in xrange(64))
Trooper answered 21/7, 2015 at 23:53 Comment(0)
R
1

Short of using assembly or compiler-specific extensions to find the first/last bit that's set, the fastest algorithm is a binary search. First check if any of the first 32 bits are set. If so, check if any of the first 16 are set. If so, check if any of the first 8 are set. Etc. Your function to do this can directly return an odd prime at each leaf of the search, or it can return a bit index which you use as an array index into a table of odd primes.

Here's a loop implementation for the binary search, which the compiler could certainly unroll if that's deemed to be optimal:

uint32_t mask=0xffffffff;
int pos=0, shift=32, i;
for (i=6; i; i--) {
    if (!(val&mask)) {
        val>>=shift;
        pos+=shift;
    }
    shift>>=1;
    mask>>=shift;
}

val is assumed to be uint64_t, but to optimize this for 32-bit machines, you should special-case the first check, then perform the loop with a 32-bit val variable.

Railroader answered 12/8, 2010 at 6:12 Comment(0)
J
1

Call the GNU POSIX extension function ffsll, found in glibc. If the function isn't present, fall back on __builtin_ffsll. Both functions return the index + 1 of the first bit set, or zero. With Visual-C++, you can use _BitScanForward64.

Jacquenette answered 12/8, 2010 at 6:57 Comment(0)
I
0
unsigned bit_position = 0;
while ((value & 1) ==0)
{
   ++bit_position;
   value >>= 1;
}

Then look up the primes based on bit_position as you say.

Impolicy answered 12/8, 2010 at 6:6 Comment(5)
Too slow! The average position of a bit is about 10, that is, value = 1<<10. Your solution would take ~40 clocks or > 10ns. With the number of times I'm using this, that would take weeks or months.Foochow
@R...: Maybe. Then again, the code is small and would fit in the CPU's level 1 cache, which could make it very fast. Profiling is the only way to be sure.Lilla
Performing up to 10 times as many loop iterations as you should is not going to be fast. Optimal code can easily fit in L1 cache too.Railroader
Fine, but the crucial 'average bit position' information wasn't provided in the original question, so it would have been a bit hard to take that into account.Lilla
@Matthew: if the 64 possible numbers were chosen with equal probability this would take about 128 clocks, which is worse!Foochow
F
0

You may find that log(n) / log(2) gives you the 0, 1, 2, ... you're after in a reasonable timeframe. Otherwise, some form of hashtable based approach could be useful.

Fugacity answered 12/8, 2010 at 6:10 Comment(6)
You'd be converting a 64-bit integer to a floating point type, which I think is not a particularly cheap operation.Fortunetelling
True - perhaps one to save for a quantum computer.Fugacity
It's not just a question of how expensive it is, you need a compiler that supports long double, a double only has 53 bits of mantissa, so you cannot precisely convert a 64-bit to that type. I believe 80-bit long doubles have 64 bits of mantissa.Unbeaten
@Praetorian: actually, even plain float will work just fine. The argument is an exact power of 2, so zero bits of mantissa are needed. All that's needed is for the exponent to fit.Railroader
@R..: And a double has 11 bits of exponent, which is more than enough to fit values up to 64.Fortunetelling
@R..: Of course, you're right; I forgot about the exact power of 2 constraint.Unbeaten
R
0

Another answer assuming IEEE float:

int get_bit_index(uint64_t val)
{
    union { float f; uint32_t i; } u = { val };
    return (u.i>>23)-127;
}

It works as specified for the input values you asked for (exactly 1 bit set) and also has useful behavior for other values (try to figure out exactly what that behavior is). No idea if it's fast or slow; that probably depends on your machine and compiler.

Railroader answered 12/8, 2010 at 6:49 Comment(0)
F
0

From the GnuChess source:

unsigned char leadz (BitBoard b)
/**************************************************************************
 *
 *  Returns the leading bit in a bitboard.  Leftmost bit is 0 and
 *  rightmost bit is 63.  Thanks to Robert Hyatt for this algorithm.
 *
 ***************************************************************************/
{
  if (b >> 48) return lzArray[b >> 48];
  if (b >> 32) return lzArray[b >> 32] + 16;
  if (b >> 16) return lzArray[b >> 16] + 32;
  return lzArray[b] + 48;
}

Here lzArray is a pregenerated array of size 2^16. This'll save you 50% of the operations compared to a full binary search.

Furnishing answered 12/8, 2010 at 7:7 Comment(0)
H
0

This is for 32 bit, java, but it should be possible to adapt it to 64 bit. It assume this will be the fastest cause there is no branching involved.

static public final int msb(int n) {
    n |= n >>> 1;  
    n |= n >>> 2; 
    n |= n >>> 4; 
    n |= n >>> 8; 
    n |= n >>> 16; 
    n >>>= 1;
    n += 1; 
    return n;
}

static public final int msb_index(int n) {

    final int[] multiply_de_bruijn_bit_position = {
        0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
        31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
    };
    return multiply_de_bruijn_bit_position[(msb(n) * 0x077CB531) >>> 27];
}

Here is more information from: http://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup

// Count the consecutive zero bits (trailing) on the right with multiply and lookup

unsigned int v;  // find the number of trailing zeros in 32-bit v 
int r;           // result goes here
static const int MultiplyDeBruijnBitPosition[32] = 
{
  0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, 
  31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};
r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];

// Converting bit vectors to indices of set bits is an example use for this. 
// It requires one more operation than the earlier one involving modulus 
// division, but the multiply may be faster. The expression (v & -v) extracts 
// the least significant 1 bit from v. The constant 0x077CB531UL is a de Bruijn 
// sequence, which produces a unique pattern of bits into the high 5 bits for 
// each possible bit position that it is multiplied against. When there are no 
// bits set, it returns 0. More information can be found by reading the paper 
// Using de Bruijn Sequences to Index 1 in a Computer Word by 
// Charles E. Leiserson, Harald Prokof, and Keith H. Randall. 

and as last: http://supertech.csail.mit.edu/papers/debruijn.pdf

Harrington answered 30/5, 2019 at 19:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.