I am using numba 0.57.1 and I would like to exploit the native CPU popcount in my code. My existing code is too slow as I need to run it hundreds of millions of times. Here is a MWE:
import numba as nb
@nb.njit(nb.uint64(nb.uint64))
def popcount(x):
b=0
while(x > 0):
x &= x - nb.uint64(1)
b+=1
return b
print(popcount(43))
The current speed is:
%timeit popcount(255)
148 ns ± 0.369 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
I believe that it should be possible to use the native CPU popcount (_mm_popcnt_u64 as a C intrinsic) instruction using numba intrinsics but this area is new to me. The llvm intrinsic I need to use is, I think, ctpop. Assuming this is right, how can I do this?
popcount()
function using the intrinsic, but YMMV). – Doc