De Bruijn algorithm binary digit count 64bits C#
Asked Answered
E

4

9

Im using the "De Bruijn" Algorithm to discover the number of digits in binary that a big number (up to 64bits) has.

For example:

  • 1022 has 10 digits in binary.
  • 130 has 8 digits in binary.

I found that using a table lookup based on De Bruijn give me the power to calculate this x100 times faster than conventional ways (power, square, ...).

According to this website, 2^6 has the table to calculate the 64 bits numbers. this would be the table exposed in c#

static readonly int[] MultiplyDeBruijnBitPosition2 = new int[64]
{
  0,1,2,4,8,17,34,5,11,23,47,31,63,62,61,59,
  55,46,29,58,53,43,22,44,24,49,35,7,15,30,60,57,
  51,38,12,25,50,36,9,18,37,10,21,42,20,41,19,39,
  14,28,56,48,33,3,6,13,27,54,45,26,52,40,16,32
};

(I dont know if i brought the table from that website correctly) Then, based on the R.. comment here. I should use this to use the table with the input uint64 number.

public static int GetLog2_DeBruijn(ulong v)
{
return MultiplyDeBruijnBitPosition2[(ulong)(v * 0x022fdd63cc95386dull) >> 58];
}

But the c# compiler doesnt allow me to use "0x022fdd63cc95386dull" because it overflows 64bits. And i have to use "0x022fdd63cc95386d" instead.

Using those codes. The problem is that i am not getting the correct result for the input given.

For example, doing 1.000.000 calculations of the number: 17012389719861204799 (64bits used) This is the result:

  • Using pow2 method i get the result 64 1 Million times in 1380ms.
  • Using DeBruijn method i get the result 40 1 Million times in 32ms. (Dont know why 40)

Im trying to understand how "De Bruijn" works, and how can i fix this and create a final code for c# to calculate up to 64bits numbers.

UDPATE and benchmarks of different solutions

I was looking for the fastest algorithm to get the number of digits in binary that a unsigned given number of 64bits has in c# (known as ulong).

For example:

  • 1024 has 11 binary digits. (2^10+1) or (log2[1024]+1)
  • 9223372036854775808 has 64 binary digits. (2^63+1) or (log2[2^63]+1)

The conventional power of 2 and square is extremely slow. and just for 10000 calculations it needs 1500ms to get the answer. (100M calculations needs hours).

Here, Niklas B., Jim Mischel, and Spender brought differents methods to make this faster.

  • SIMD and SWAR Techniques //Provided by Spender (Answer here)
  • De_Bruijn Splited 32bits //Provided by Jim Mischel (Answer here)
  • De_Bruijn 64bits version //Provided by Niklas B. (Answer here)
  • De_Bruijn 128bits version //Also provided by Niklas B. (Answer here)

Testing this Methods with a CPU Q6600 overclocked to 3Ghz using Windows 7 (64bits) Gives the following results.

Performance Test

As you can see, it takes just a few seconds to find correctly 100,000,000 of request given, being De_Bruijn 128bits version the fastest.

Thanks a lot to all of you, you help me a lot with this. I hope this helps you too.

Eyla answered 19/2, 2014 at 17:39 Comment(7)
0x022fdd63cc95386dull gives a compiler error. Wouldn't it be 0x022fdd63cc95386dUL? It doesn't overflow 64-bits here.Pincince
i tried 0x022fdd63cc95386dul before and it gives me the same 40 as result.Eyla
Your test is faulty. My 64bit version contains a branch, but you always test with the same number, so the branch predictor can predict perfectly. I guess if you test with random input, my branching version will be significantly slower. For your input the lookup table will not even be used in my branching version. Also, depending on your program the popcount-variant without a lookup table might actually be better since the table would get evicted from the cache in between calls if you only infrequently call the functionMultifid
I tested with 0, 1, 3, 17, 175, 1026, 123456789, 987654321, 2^45+11, 899809343(Prime big num), 2^63-2^20... The speed is rounded to the same always (+-200ms). But the answers are slightly different (for example using 0) And also, the 64bit version is faster when using >2^63 thanks to the branch.Eyla
@Ignacio the branch doesn't make anything faster, it makes it slower! But of course if you test the number multiple times bin a row the branch predictor kicks in. You have to test different numbers every time you call the function, otherwise those tests don't say anythingMultifid
thanks to you now im starting using it in the core system (it uses Billions of 64 bits numbers per hour, consider it random). Im using the 128bits version without branchs, but yesterday i used the 64bits version with the branch to prevent >2^63 making all the operations 2ms slower, and when is bigger than 2^63, is significantly faster (it needs 1/3 the normal time) so at the end, in this case, in big proportions, the branch compensate the lag of having it. But as i was saying, im using 128bits version that is always faster than any other. Thank you again.Eyla
What about the case where you want an array of the positions of all the bits set?Complexioned
M
5

You should check R..'s answer and his resource again. The question that he responded to was how to find the log2 for powers of two.

The bit twiddling website says that the simple multiplication + shift only works "If you know that v is a power of 2". Otherwise you need to round up to the next power of two first:

static readonly int[] bitPatternToLog2 = new int[64] { 
    0, // change to 1 if you want bitSize(0) = 1
    1,  2, 53,  3,  7, 54, 27, 4, 38, 41,  8, 34, 55, 48, 28,
    62,  5, 39, 46, 44, 42, 22,  9, 24, 35, 59, 56, 49, 18, 29, 11,
    63, 52,  6, 26, 37, 40, 33, 47, 61, 45, 43, 21, 23, 58, 17, 10,
    51, 25, 36, 32, 60, 20, 57, 16, 50, 31, 19, 15, 30, 14, 13, 12
}; // table taken from http://chessprogramming.wikispaces.com/De+Bruijn+Sequence+Generator
static readonly ulong multiplicator = 0x022fdd63cc95386dUL;

public static int bitSize(ulong v) {
    v |= v >> 1;
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;
    v |= v >> 32;
    // at this point you could also use popcount to find the number of set bits.
    // That might well be faster than a lookup table because you prevent a 
    // potential cache miss
    if (v == (ulong)-1) return 64;
    v++;
    return MultiplyDeBruijnBitPosition2[(ulong)(v * multiplicator) >> 58];
}

Here is a version with a larger lookup table that avoids the branch and one addition. I found the magic number using random search.

static readonly int[] bitPatternToLog2 = new int[128] {
    0, // change to 1 if you want bitSize(0) = 1
    48, -1, -1, 31, -1, 15, 51, -1, 63, 5, -1, -1, -1, 19, -1, 
    23, 28, -1, -1, -1, 40, 36, 46, -1, 13, -1, -1, -1, 34, -1, 58,
    -1, 60, 2, 43, 55, -1, -1, -1, 50, 62, 4, -1, 18, 27, -1, 39, 
    45, -1, -1, 33, 57, -1, 1, 54, -1, 49, -1, 17, -1, -1, 32, -1,
    53, -1, 16, -1, -1, 52, -1, -1, -1, 64, 6, 7, 8, -1, 9, -1, 
    -1, -1, 20, 10, -1, -1, 24, -1, 29, -1, -1, 21, -1, 11, -1, -1,
    41, -1, 25, 37, -1, 47, -1, 30, 14, -1, -1, -1, -1, 22, -1, -1,
    35, 12, -1, -1, -1, 59, 42, -1, -1, 61, 3, 26, 38, 44, -1, 56
};
static readonly ulong multiplicator = 0x6c04f118e9966f6bUL;

public static int bitSize(ulong v) {
    v |= v >> 1;
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;
    v |= v >> 32;
    return bitPatternToLog2[(ulong)(v * multiplicator) >> 57];
}

You should definitely check other tricks to compute the log2 and consider using the MSR assembly instruction if you are on x86(_64). It gives you the index of the most significant set bit, which is exactly what you need.

Multifid answered 19/2, 2014 at 17:56 Comment(11)
it still give me wrong results, i didnt use v-- and v++, now for example, the number 17012389719861204799 say that is 32, and the rest are other wierd results. any advise?Eyla
@IgnacioBustos: Please read my answer carefuly: "The function won't work for v = 0 or v > 2^63, because in those cases you get overflows. You might want to precheck those cases or restrict your input range." So you have to branch if you want to support these inputs with the suggested technique: if (v == 0) return 0; and before the v++;: if (v == (ulong)-1) return 64;Multifid
@IgnacioBustos: Also, you need to use v-- and v++, because that's how the algorithm works. You can get around the v++ by using a different magic number than 0x022fdd63cc95386d, but again, I've written all of this in my answer already.Multifid
i know what did you say,but appart of that, i tried differents regular numbers, with and without v++ or v--, numbers under 8 included works correctly, after this it start to jump, 1024 gives 39, and 1M gives 3 for example. i think the problem might be in the table or 0x022fdd63cc95386d for 64bits calculations. I want to fix this, because is the fastest one, just around 44ms..Eyla
@IgnacioBustos: Your table seems to be wrong. I included a version that works.Multifid
I only tested it in Python since I'm on Linux so bear with me if there is a problem with the C# code (I know almost no C#)Multifid
That was the problem, and also, i desactivated v-- because if not, 1023 give me 10, and it has to be 9. With this change the result is the log2 of the number plus one to get the number of digits. Added also if(v>=2^63) and if(v==0) to get apropiated answers. Time spent around 46ms. :D ill explain all the results at the top in a while.Eyla
@Ignacio: You can leave away the if (v==0) branch if you don't do v--. I edited the answer. By the way, your table was just the reverse of the correct one (dunno where you got that, since it's not actually useful in the other direction)Multifid
i took the table as i read it... very bad for me. About v--, i didnt see u changed that, but anyway, without it the result is the log2+1 that i am looking for, except on 2^63 or more, that gives 0 (overflow) and 0 that gives 0 (has to give 1, or that is what i want) so 1 or 2 conditions fix this and in those cases it takes just 17ms. This is what i was looking for, but ill make a sumary of all of the options together in the last edit of the question.Eyla
@IgnacioBustos: As an additional bonus, I added a version with a larger lookup table that avoids the branching and one addition (could be significantly faster).Multifid
@IgnacioBustos: Oh and you can just change the lookup table for the special case v == 0. No need for an extra branch.Multifid
P
3

After perusing various bit-twiddling info, this is how I'd do it... don't know how this stacks up next to DeBruijn, but should be considerably faster than using powers.

ulong NumBits64(ulong x)
{
    return (Ones64(Msb64(x) - 1ul) + 1ul);
}

ulong Msb64(ulong x)
{  
    //http://aggregate.org/MAGIC/
    x |= (x >> 1);
    x |= (x >> 2);
    x |= (x >> 4);
    x |= (x >> 8);
    x |= (x >> 16);
    x |= (x >> 32);
    return(x & ~(x >> 1));
}

ulong Ones64(ulong x)
{
    //https://chessprogramming.wikispaces.com/SIMD+and+SWAR+Techniques
    const ulong k1 = 0x5555555555555555ul;
    const ulong k2 = 0x3333333333333333ul;
    const ulong k4 = 0x0f0f0f0f0f0f0f0ful;
    x = x - ((x >> 1) & k1);
    x = (x & k2) + ((x >> 2) & k2);
    x = (x + (x >> 4)) & k4;
    x = (x * 0x0101010101010101ul) >> 56;
    return x;
}
Pincince answered 19/2, 2014 at 18:10 Comment(2)
I checked this one, 88ms with 1M calculations and correct answers. cool, let me try the rest methods to beat this one. :D, i will edit with my final results on the question soon.Eyla
I edited the question with the solution, including your option tested.Eyla
B
2

When I looked into this a while back for 32 bits, the DeBruijn sequence method was by far the fastest. See https://mcmap.net/q/429819/-find-bit-position-without-using-log

What you could do for 64 bits is split the number into two 32-bit values. If the high 32 bits is non-zero, then run the DeBruijn calculation on it, and then add 32. If the high 32 bits is zero, then run the DeBruijn calculation on the low 32 bits.

Something like this:

int NumBits64(ulong val)
{
    if (val > 0x00000000FFFFFFFFul)
    {
        // Value is greater than largest 32 bit number,
        // so calculate the number of bits in the top half
        // and add 32.
        return 32 + GetLog2_DeBruijn((int)(val >> 32));
    }
    // Number is no more than 32 bits,
    // so calculate number of bits in the bottom half.
    return GetLog2_DeBruijn((int)(val & 0xFFFFFFFF));
}

int GetLog2_DeBruijn(int val)
{
    uint32 v = (uint32)val;
    int r;      // result goes here

    static const int MultiplyDeBruijnBitPosition[32] = 
    {
        0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
        8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
    };

    v |= v >> 1; // first round down to one less than a power of 2 
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;

    r = MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
    return r;
}
Bilek answered 19/2, 2014 at 18:34 Comment(8)
One small suggestion would be to use (uint)-1 or (1ul<<32)-1 to make the magic number a bit more readableMultifid
I also don't think this works for vals that are not powers of two, or does it? I don't think the method is so much superior to the others if you first need to isolate the MSB using a lg(lg(n)) algorithmMultifid
@NiklasB. I updated my answer to show the version that doesn't require val to be a power of 2. If you know that val is a power of 2, then you just need the lookup and multiply.Bilek
Yes, but then you need a different constant (0x077CB531U according to graphics.stanford.edu/~seander/bithacks.html#IntegerLogDeBruijn)Multifid
Do you know whether it's still faster than other popcount-based approaches even with the v |= ... preamble? That would be interesting to seeMultifid
@NiklasB. The constant I have in the example above is the correct one. I don't recall how it compares to the other techniques. The pure table based approach might be faster.Bilek
Yes it's correct. But if you have powers of two and just do multiply + lookup, you need a different constant (that is in addition to your comment about powers of two)Multifid
i checked this one with 10 differents numbers, including 0 and the uint64.maxvalue and works perfectly. The result needs to add +1 to get the number of digits. Time spent for 1M calculations: ~53ms. Amazing, ill put all of this info in a final edit after all the tests.Eyla
P
1

Edit: This solution is not recommanded as it requires branching for Zero.

After reading Niklas B's answer I spent a few hours on researching this, and realize all magic multiplicator has to be in the last nth in order to suit for 64-elements lookup table (I don't have the necessary knowledge to explain why).

So I used exactly the same generator mentioned by that answer to find the last sequence, here is the C# code:

// used generator from http://chessprogramming.wikispaces.com/De+Bruijn+Sequence+Generator
static readonly byte[] DeBruijnMSB64table = new byte[]
{
    0 , 47, 1 , 56, 48, 27, 2 , 60,
    57, 49, 41, 37, 28, 16, 3 , 61,
    54, 58, 35, 52, 50, 42, 21, 44,
    38, 32, 29, 23, 17, 11, 4 , 62,
    46, 55, 26, 59, 40, 36, 15, 53,
    34, 51, 20, 43, 31, 22, 10, 45,
    25, 39, 14, 33, 19, 30, 9 , 24,
    13, 18, 8 , 12, 7 , 6 , 5 , 63,
};
// the cyclc number has to be in the last 16th of all possible values
// any beyond the 62914560th(0x03C0_0000) should work for this purpose
const ulong DeBruijnMSB64multi = 0x03F79D71B4CB0A89uL; // the last one
public static byte GetMostSignificantBit(this ulong value)
{
    value |= value >> 1;
    value |= value >> 2;
    value |= value >> 4;
    value |= value >> 8;
    value |= value >> 16;
    value |= value >> 32;

    return DeBruijnMSB64table[value * DeBruijnMSB64multi >> 58];
}
Pitchstone answered 16/3, 2016 at 3:19 Comment(4)
Immediately after I find this number out, I Google'd it and find the author of the generator got this number 10 years ago but didn't realize it can be used to get the MSB back then; and it's already listed herePitchstone
this solution seems to be discovered or at least not late than year of 2012Pitchstone
After the OR+shift sequence, your value is 1 less than the next power of 2. That is, if the MSB is bit 3, then the value will be 2^4-1, which is 0xF. But note that multiplying by 2^(N+1) is the same as shifting left by N+1. Taking the top 6 bits allows you to get any of the 64 possible permutations in the DeBruijn string. (Note that the 5 MSBs in the string are 0s!) This function is similar to the one used for the LSB, which uses an exact power of two, but it was clever to note subtracting the top 6 bits of the multiplier (which happen to be zero) doesn't cause a collision in the function.Pecoraro
The reason for extracting the top 6 MSBs is two-fold. First, the left shift performed by multiplying by the power-of-two rotates the DeBruijn string. Left shift shifts in zeros, so if you took any other 6 consecutive bits, more than one input would produce zero, and you it would not be possible to use a LUT. Because the 5 MSBs of the string are zero, left shift and left rotate produce the same result in the top 6 bits, which is what allows multiply to be used to index the DeBruijn string. This would coincidentally also work for bits 57-62. However, you'd additionally need to mask off bit 63.Pecoraro

© 2022 - 2024 — McMap. All rights reserved.