Hamming weight based indexing

Asked 24/11, 2012 at 15:53 Answered 28/11, 2012 at 20:52

Solved c sorting cuda indexing hammingweight

Assume we have a integer of bitsize n=4;
The problem I am describing is how you would go about indexing a number to an array position based on the Hamming weight and its value knowing the bitsize. E.g. An array with 16 elements for bitsize 4 would/could look like this:

|0|1|2|4|8|3|5|6|9|10|12|7|11|13|14|15|

Where elements are grouped by their Hamming weight(necessary) and sorted based on size(not necessary). Sorting is not necessary as long as you can take e.g. 3(0011) do some operations and get back index 5, 5(0101) -> 6 etc.

All combinations of n bits will be present and there will be no duplication. E.g. bitsize of 3 would have the array:

|0|1|2|4|3|5|6|7|

I would preferably have a solution without loops. Or any papers that discuss simillar solutions. Or finally just throw out any ides on how you could go about doing that.

Kliman answered 24/11, 2012 at 15:53 Comment(6)

I am not sure I understand what you are trying to accomplish. In CUDA you can determine the Hamming weight of an integer with the __popc() intrinsic, which is implemented on newer GPUs as a fast hardware instruction and maps to an efficient software emulation otherwise. You can then use the output of __popc() to index into the array representing your lookup table. – Hamann 24/11, 2012 at 18:4

The thing is right now, I am pulling memory all over the place, aka random access. I want to group memory by their popcount, as I request memory only for a certain popcount (_popc(x) <= z) for each kernel and following kernels. So by having a function f that translates from the popcount and value gives me an index in the array I will not be penalised by random access. The lookup table you describe also need a function that takes the value and gives back a index, which is the same problem. I am working with big arrays think 2^n where n > 20, so space is constrained. – Kliman 24/11, 2012 at 18:28

Does the conversion n <-> index need to be fast? – Dryfoos 24/11, 2012 at 19:27

The faster the better, but "slow" works fine too, it should avoid any memory operations and do it arithmetically. I think I have a solution, but need eat first. – Kliman 24/11, 2012 at 20:33

Hmm, my idea was 1. number of integers of given width with fewer bits set. 2. count how many smaller numbers with the same number of bits exist, 3. add. It's O((log width)^a) for a either 1 or 2 don't remember OTTOMH, probably too slow for your purposes. – Dryfoos 24/11, 2012 at 22:45

Thought of that to, just the step one that I can not think of how to calculate. it is zigma(i=1,b-1)pick(i,n) where b == current bitsize and n == maximum bitsize how do you do that with an O(~1) operation? – Kliman 25/11, 2012 at 12:4

Note that you can enumerate numbers (in counting order) with the same hamming weight using the following functions:

int next(int n) { // get the next one with same # of bits set
  int lo = n & -n;       // lowest one bit
  int lz = (n + lo) & ~n;      // lowest zero bit above lo
  n |= lz;                     // add lz to the set
  n &= ~(lz - 1);              // reset bits below lz
  n |= (lz / lo / 2) - 1;      // put back right number of bits at end
  return n;
}

int prev(int n) { // get the prev one with same # of bits set
   int y = ~n;
   y &= -y; // lowest zero bit
   n &= ~(y-1); // reset all bits below y
   int z = n & -n; // lowest set bit
   n &= ~z;        // clear z bit
   n |= (z - z / (2*y)); // add requried number of bits below z
   return n;
 }

As an example, repititive application of prev() on x = 5678:

0: 00000001011000101110 (5678)
1: 00000001011000101101 (5677)
2: 00000001011000101011 (5675)
3: 00000001011000100111 (5671)
4: 00000001011000011110 (5662)
5: 00000001011000011101 (5661)
6: 00000001011000011011 (5659)
.....

Hence theoretically you can compute the index of a number by repititive application of this. However this can take very long. The better approach would be to "jump" over some combinations.

There are 2 rules:

 1. if the number starts with: ..XXX10..01..1 we can replace it by ..XXX0..01..1
adding corresponding number of combinations
 2. if the number starts with: ..XXX1..10..0 again replace it by XXX0..01..1 with corresponding number of combinations

The following algorithm computes the index of a number among the numbers with the same Hamming weight (i did not bother about fast implementation of binomial):

#define LOG2(x) (__builtin_ffs(x)-1)

int C(int n, int k) { // simple implementation of binomial
 int c = n - k; 
 if(k < c) 
   std::swap(k,c);
 if(c == 0)
  return 1;
 if(k == n-1) 
  return n;
 int b = k+1;
 for(int i = k+2; i <= n; i++) 
    b = b*i;
 for(int i = 2; i <= c; i++)
   b = b / i;
 return b;
}
int position_jumping(unsigned x) {
   int index = 0;
  while(1) {

    if(x & 1) { // rule 1: x is of the form: ..XXX10..01..1
        unsigned y = ~x;
        unsigned lo = y & -y; // lowest zero bit
        unsigned xz = x & ~(lo-1); // reset all bits below lo
        unsigned lz = xz & -xz; // lowest one bit after lo
        if(lz == 0) // we are in the first position!
           return index;

        int nn = LOG2(lz), kk = LOG2(lo)+1;       
        index += C(nn, kk); //   C(n-1,k) where n = log lz and k = log lo + 1

        x &= ~lz; //! clear lz bit
        x |= lo; //! add lo

    } else { // rule 2: x is of the form: ..XXX1..10..0
        int lo = x & -x; // lowest set bit
        int lz = (x + lo) & ~x;  // lowest zero bit above lo  
        x &= ~(lz-1); // clear all bits below lz
        int sh = lz / lo;

        if(lz == 0) // special case meaning that lo is in the last position
            sh=((1<<31) / lo)*2;
        x |= sh-1;

        int nn = LOG2(lz), kk = LOG2(sh);
        if(nn == 0)
           nn = 32;
        index += C(nn, kk);
    }
    std::cout << "x: " << std::bitset<20>(x).to_string() << "; pos: " << index << "\n";
  }
 }

For example, given the number x=5678 the algorithm will compute its index in just 4 iterations:

  x: 00000001011000100111; pos: 4
  x: 00000001011000001111; pos: 9
  x: 00000001010000011111; pos: 135
  x: 00000001000000111111; pos: 345
  x: 00000000000001111111; pos: 1137

Note that 1137 is the position of 5678 within the group of numbers with the same Hamming weight. Hence you would have to shift this index accordingly to account for all the numbers with smaller Hamming weights

Madaras answered 28/11, 2012 at 20:52 Comment(6)

Interesting. Will have to look at it later though, but looks promising. – Kliman 28/11, 2012 at 23:37

@ks6g10: for computation of binomials I would suggest you using Pascal triangle. You can compute it pretty fast on the gpu in shared mem. E.g. for 24-bit numbers you would need 24*24 words shared space which is affordable. Alternatively, you can just prepare the table on the host and load it into constant memory. Note that, the solution of @ cctan is good but it uses a lot of transcendental math which might not be very fast on the gpu – Madaras 29/11, 2012 at 10:20

@asm It is really strange why your answers are good, but not much people upvote them. I wish I can +2 to this. – Shortie 4/12, 2012 at 9:22

thanks @cctan, apprecitate that. People usually do not vote for something they either do not understand or which is not popular, i.e. not related to iphone or C# or similar stuff ;) – Madaras 4/12, 2012 at 11:50

@asm Have been really busy lately. But at the moment my application is bound by bandwidth so I can afford to use instructions quite heavily, as long as I do not exceed the amount of register used for 100% utilization. The table mentioned, do you mean like the answer here #3959711. Or more a "pick 2 out of 12 (P[2][12])" table? – Kliman 5/12, 2012 at 14:48

@ks6g10: i mean the table to store a pascal triangle in shared memory (not in registers!) since you need threads to communicate in order to evaluate it. Basically, one warp is sufficient to precoompute a triangle for 32-bit integers – Madaras 5/12, 2012 at 16:5

Here is a concept work, just to get the discussion started.
The step one is hardest - solved using approximation to calculate factorials.
Anymore bright ideas?

Ideone link

#include <stdio.h>
#include <math.h>

//gamma function using Lanczos approximation formula
//output result in log base e
//use exp() to convert back
//has a nice side effect: can store large values in small [power of e] form
double logGamma(double x)
{
    double tmp = (x-0.5) * log(x+4.5) - (x+4.5);
    double ser = 1.0 + 76.18009173     / (x+0) - 86.50532033    / (x+1)
                     + 24.01409822     / (x+2) -  1.231739516   / (x+3)
                     +  0.00120858003  / (x+4) -  0.00000536382 / (x+5);
    return tmp + log(ser * sqrt(2*M_PI) );  
}

//result from logGamma() are actually (n-1)!
double combination(int n, int r)
{
    return exp(logGamma(n+1)-( logGamma(r+1) + logGamma(n-r+1) ));
}

//primitive hamming weight counter
int hWeight(int x)
{
    int count, y;
    for (count=0, y=x; y; count++)
        y &= y-1; 
    return count;
}

//-------------------------------------------------------------------------------------
//recursively find the previous group's "hamming weight member count" and sum them
int rCummGroupCount(int bitsize, int hw)
{
    if (hw <= 0 || hw == bitsize) 
        return 1;
    else
        return round(combination(bitsize, hw)) + rCummGroupCount(bitsize,hw-1);
}
//-------------------------------------------------------------------------------------

int main(int argc, char* argv[])
{
    int bitsize = 4, integer = 14;
    int hw = hWeight(integer);
    int groupStartIndex = rCummGroupCount(bitsize,hw-1);
    printf("bitsize: %d\n", bitsize);
    printf("integer: %d  hamming weight: %d\n", integer, hw);
    printf("group start index: %d\n", groupStartIndex);
}

output:

bitsize: 4
integer: 14 hamming weight: 3
group start index: 11

Shortie answered 27/11, 2012 at 15:35 Comment(2)

Looks good, but is there a way to do the rCummGroupCount iterative? And how precise is the Lanczos approximation formula? – Kliman 28/11, 2012 at 15:27

@ks6g10 It will take some time to look into the precision issue and recursive rCummGroupCount. Using a lookup table for factorial is much simpler though, you may look into that. – Shortie 4/12, 2012 at 9:24

Recommended topics

Hot tags