Calculating Hamming weight efficiently in matlab

Asked 21/6, 2009 at 22:40 Answered 27/11, 2013 at 11:34

Solved matlab bit-manipulation bitstring hammingweight

Given a MATLAB uint32 to be interpreted as a bit string, what is an efficient and concise way of counting how many nonzero bits are in the string?

I have a working, naive approach which loops over the bits, but that's too slow for my needs. (A C++ implementation using std::bitset count() runs almost instantly).

I've found a pretty nice page listing various bit counting techniques, but I'm hoping there is an easy MATLAB-esque way.

http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive

Update #1

Just implemented the Brian Kernighan algorithm as follows:

w = 0;
while ( bits > 0 )
    bits = bitand( bits, bits-1 );
    w = w + 1;
end

Performance is still crappy, over 10 seconds to compute just 4096^2 weight calculations. My C++ code using count() from std::bitset does this in subsecond time.

Update #2

Here is a table of run times for the techniques I've tried so far. I will update it as I get additional ideas/suggestions.

Vectorized Scheiner algorithm                =>    2.243511 sec
Vectorized Naive bitget loop                 =>    7.553345 sec
Kernighan algorithm                          =>   17.154692 sec
length( find( bitget( val, 1:32 ) ) )        =>   67.368278 sec
nnz( bitget( val, 1:32 ) )                   =>  349.620259 sec
Justin Scheiner's algorithm, unrolled loops  =>  370.846031 sec
Justin Scheiner's algorithm                  =>  398.786320 sec
Naive bitget loop                            =>  456.016731 sec
sum(dec2bin(val) == '1')                     => 1069.851993 sec

Comment: The dec2bin() function in MATLAB seems to be very poorly implemented. It runs extremely slow.

Comment: The "Naive bitget loop" algorithm is implemented as follows:

w=0;
for i=1:32
   if bitget( val, i ) == 1
       w = w + 1;
   end
end

Comment: The loop unrolled version of Scheiner's algorithm looks as follows:

function w=computeWeight( val )
w = val;
w = bitand(bitshift(w, -1), uint32(1431655765)) + ...
    bitand(w, uint32(1431655765));

w = bitand(bitshift(w, -2), uint32(858993459)) + ...
    bitand(w, uint32(858993459));

w = bitand(bitshift(w, -4), uint32(252645135)) + ...
    bitand(w, uint32(252645135));

w = bitand(bitshift(w, -8), uint32(16711935)) + ...
    bitand(w, uint32(16711935));

w = bitand(bitshift(w, -16), uint32(65535)) + ...
    bitand(w, uint32(65535));

Bost answered 21/6, 2009 at 22:40 Comment(3)

Is it possible to make some sort of cleanup on this question? Small question and move the other things to an summary answer for example? Related question here, far easier to understand as a small. – Treacherous 7/11, 2013 at 22:44

-1 too unclear question and no improvement done despite the notice. – Treacherous 12/11, 2013 at 14:1

@kay Can you please give the code for the Vectorized version of the "Naive bitget loop" ? – Tsarevna 5/10, 2017 at 14:11

I'd be interested to see how fast this solution is:

function r = count_bits(n)

shifts = [-1, -2, -4, -8, -16];
masks = [1431655765, 858993459, 252645135, 16711935, 65535];

r = n;
for i=1:5
   r = bitand(bitshift(r, shifts(i)), masks(i)) + ...
      bitand(r, masks(i));
end

Going back, I see that this is the 'parallel' solution given on the bithacks page.

Margetts answered 22/6, 2009 at 1:23 Comment(4)

I just posted performance using your pre-edit algorithm. This was with hex2dec pre-computed. I'm going to double check whether I did everything correctly and also try your cleaned up code. – Bost 22/6, 2009 at 1:46

I think that this would be the fastest method by far for 64 bit integers. All the other methods are O(n) but this is O(logn). It would probably be significantly faster with the loop unrolled. – Margetts 22/6, 2009 at 2:8

I'm actually running a loop unrolled version right now. I am surprised by this methods poor performance in the looped version; I also thought it would be fastest. – Bost 22/6, 2009 at 2:19

Vectorizing this code for my inner loop leads to 2.25 second performance. (Same outline as with gnovice's solution). – Bost 22/6, 2009 at 2:41

EDIT: NEW SOLUTION

It appears that you want to repeat the calculation for every element in a 4096-by-4096 array of UINT32 values. If this is what you are doing, I think the fastest way to do it in MATLAB is to use the fact that BITGET is designed to operate on matrices of values. The code would look like this:

numArray = ...your 4096-by-4096 matrix of uint32 values...
w = zeros(4096,4096,'uint32');
for iBit = 1:32,
  w = w+bitget(numArray,iBit);
end

If you want to make vectorized versions of some of the other algorithms, I believe BITAND is also designed to operate on matrices.

The old solution...

The easiest way I can think of is to use the DEC2BIN function, which gives you the binary representation (as a string) of a non-negative integer:

w = sum(dec2bin(num) == '1');  % Sums up the ones in the string

It's slow, but it's easy. =)

Palingenesis answered 21/6, 2009 at 23:49 Comment(4)

The cast to double is not needed. You're technique works. Unfortunately, dec2bin() is dirt slow. I'm compiling a table of runtimes for all my approaches, and dec2bin is still running. (Well past the other techniques in terms of time). – Bost 22/6, 2009 at 0:29

No wonder... I just now realized you're repeating the calculation 4096^2 times!!! I'll have to give it more thought to see if there are faster ways to handle so many calculations in native MATLAB. – Palingenesis 22/6, 2009 at 1:47

Very nice! I actually have a pair of loops each going from 1 to 4096. I vectorized the inner loop using your technique and overall runtime is at ~7.55 sec. I did have to pass in 'uint32' as my type to zeroes(4096,1,'uint32') for MATLAB to be happy. Trying now with outer loop vectorized too. – Bost 22/6, 2009 at 2:36

It's crappy that I can only mark one "accepted" answer. You gave the vectorization idea; Scheiner gave the algorithm. Thank you very much gnovice. – Bost 22/6, 2009 at 2:55

Unless this is a MATLAB implementation exercise, you might want to just take your fast C++ implementation and compile it as a mex function, once per target platform.

Dailey answered 22/6, 2009 at 0:24 Comment(2)

Calling an external routine is pretty unappealing for my application. I'm still hoping to drop the MATLAB code run time to a few seconds. – Bost 22/6, 2009 at 0:34

I'll take your word for it as it's your application. However, in my experience the only reason not to mex-ify MATLAB code is that for complex operations it's a bit of a hassle. But once you've got it coded up, mex files work just like normal MATLAB functions and they have platform-specific file extensions, so you can just provide them all in your package and MATLAB will figure it out automatically. You can even provide a fallback MATLAB implementation for platforms you don't have compile access to. – Dailey 22/6, 2009 at 1:40

Implemented the "Best 32 bit Algorithm" from the Stanford link at the top. The improved algorithm reduced processing time by 6%. Also optimized the segment size and found that 32K is stable and improves time by 15% over 4K. Expect 4Kx4K time to be 40% of Vectorized Scheiner Algorithm.

function w = Ham(w)
% Input uint32
% Output vector of Ham wts
 for i=1:32768:length(w)
  w(i:i+32767)=Ham_seg(w(i:i+32767));
 end
end

% Segmentation gave reduced time by 50%

function w=Ham_seg(w)
 %speed
 b1=uint32(1431655765); 
 b2=uint32(858993459);
 b3=uint32(252645135);
 b7=uint32(63); % working orig binary mask

 w = bitand(bitshift(w, -1), b1) + bitand(w, b1);
 w = bitand(bitshift(w, -2), b2) + bitand(w, b2);
 w =bitand(w+bitshift(w, -4),b3);
 w =bitand(bitshift(w,-24)+bitshift(w,-16)+bitshift(w,-8)+w,b7);

end

Spinster answered 1/7, 2012 at 16:24 Comment(0)

Did some timing comparisons on Matlab Cody. Determined a Segmented Modified Vectorized Scheiner gives optimimum performance.

Have >50% time reduction based on Cody 1.30 sec to 0.60 sec change for an L=4096*4096 vector.

function w = Ham(w)
% Input uint32
% Output vector of Ham wts

 b1=uint32(1431655765); % evaluating saves 15% of time 1.30 to 1.1 sec
 b2=uint32(858993459);
 b3=uint32(252645135);
 b4=uint32(16711935);
 b5=uint32(65535);

 for i=1:4096:length(w)
  w(i:i+4095)=Ham_seg(w(i:i+4095),b1,b2,b3,b4,b5);
 end
end

% Segmentation reduced time by 50%

function w=Ham_seg(w,b1,b2,b3,b4,b5)
 % Passing variables or could evaluate b1:b5 here


 w = bitand(bitshift(w, -1), b1) + bitand(w, b1);
 w = bitand(bitshift(w, -2), b2) + bitand(w, b2);
 w = bitand(bitshift(w, -4), b3) + bitand(w, b3);
 w = bitand(bitshift(w, -8), b4) + bitand(w, b4);
 w = bitand(bitshift(w, -16), b5) + bitand(w, b5);

end





vt=randi(2^32,[4096*4096,1])-1;
% for vt being uint32 the floor function gives unexpected values
tic
v=num_ones(mod(vt,65536)+1)+num_ones(floor(vt/65536)+1); % 0.85 sec
toc
% a corrected method is
v=num_ones(mod(vt,65536)+1)+num_ones(floor(double(vt)/65536)+1);
toc

Spinster answered 1/7, 2012 at 6:12 Comment(0)

A fast approach is counting the bits in each byte using a lookup table, then summing these values; indeed, it's one of the approaches suggested on the web page given in the question. The nice thing about this approach is that both lookup and sum are vectorizable operations in MATLAB, so you can vectorize this approach and compute the hamming weight / number of set bits of a large number of bit strings simultaneously, very quickly. This approach is implemented in the bitcount submission on the MATLAB File Exchange.

Bevus answered 27/11, 2013 at 11:34 Comment(0)

Try splitting the job into smaller parts. My guess is that if you want to process all data at once, matlab is trying to do each operation on all integers before taking successive steps and the processor's cache is invalidated with each step.

for i=1:4096,
    «process bits(i,:)»
end

Kumler answered 22/6, 2009 at 1:35 Comment(0)

I'm reviving an old thread here, but I ran across this problem and I wrote this little bit of code for it:

distance = sum(bitget(bits, 1:32));

Looks pretty concise, but I'm scared that bitget is implemented in O(n) bitshift operations. The code works for what I'm going, but my problem set doesn't rely on hamming weight.

Replenish answered 29/2, 2012 at 6:7 Comment(0)

num_ones=uint8(zeros(intmax('uint32')/2^6,1));
% one time load of array not implemented here
tic
for i=1:4096*4096
 %v=num_ones(rem(i,64)+1)+num_ones(floor(i/64)+1); % 1.24 sec
 v=num_ones(mod(i,64)+1)+num_ones(floor(i/64)+1); % 1.20 sec
end
toc
tic
num_ones=uint8(zeros(65536,1));
for i=0:65535
 num_ones(i+1)=length( find( bitget( i, 1:32 ) ) ) ;
end
toc
% 0.43 sec to load
% smaller array to initialize
% one time load of array
tic
for i=1:4096*4096
 v=num_ones(mod(i,65536)+1)+num_ones(floor(i/65536)+1); %  0.95 sec
 %v=num_ones(mod(i,65536)+1)+num_ones(bitshift(i,-16)+1); % 16 sec for 4K*1K
end
toc
%vectorized
tic
num_ones=uint8(zeros(65536,1));
for i=0:65535
 num_ones(i+1)=length( find( bitget( i, 1:32 ) ) ) ;
end % 0.43 sec
toc
vt=randi(2^32,[4096*4096,1])-1;
tic
v=num_ones(mod(vt,65536)+1)+num_ones(floor(vt/65536)+1); % 0.85 sec
toc

Spinster answered 29/6, 2012 at 14:9 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags