How to use native popcount with numba
Asked Answered
C

1

3

I am using numba 0.57.1 and I would like to exploit the native CPU popcount in my code. My existing code is too slow as I need to run it hundreds of millions of times. Here is a MWE:

import numba as nb

@nb.njit(nb.uint64(nb.uint64))
def popcount(x): 
      b=0
      while(x > 0):
          x &= x - nb.uint64(1)   
          b+=1
      return b
    
    
print(popcount(43))

The current speed is:

%timeit popcount(255)
148 ns ± 0.369 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

I believe that it should be possible to use the native CPU popcount (_mm_popcnt_u64 as a C intrinsic) instruction using numba intrinsics but this area is new to me. The llvm intrinsic I need to use is, I think, ctpop. Assuming this is right, how can I do this?

Crescendo answered 14/9, 2023 at 8:0 Comment(0)
D
4

Try:

import numba as nb
from numba import types
from numba.cpython import mathimpl
from numba.extending import intrinsic


@intrinsic
def popcnt(typingctx, src):
    sig = types.uint64(types.uint64)

    def codegen(context, builder, signature, args):
        return mathimpl.call_fp_intrinsic(builder, "llvm.ctpop.i64", args)

    return sig, codegen


@nb.njit(nb.uint64(nb.uint64))
def popcount(x):
    return popcnt(x)


print(popcount(43))

Prints:

4
Doc answered 14/9, 2023 at 8:52 Comment(5)
@Crescendo But, running a benchmark between normal njit function and function using intrinsic, the results are the same (I'm suspecting llvm already optimizes the popcount() function using the intrinsic, but YMMV).Doc
Hmm... that is mysterious as the intrinsics version should just be using one CPU instruction, not a loop as in my posted code. You can see the assembly with popcount.inspect_asm() but I can't interpret it.Crescendo
@Crescendo Yes, but LLVM compiler is so advanced, that it analyzes your function and figures out it's just popcount, so it replaces it with intrinsic. You can compile the functions with AOT and debug them (to see what actually happens): numba.pydata.org/numba-doc/dev/user/…Doc
This is the assembly for the loop version bpa.st/66DQ . It does include popcntq %rdx, %rax in it! Specifically the code the runs seems to actually be " popcntq %rdx, %rax movq %rax, (%rdi) xorl %eax, %eax retq"Crescendo
The two pieces of code do genuinely seem to be compiled to the same assembly by numba which is borderline miraculous and also frustrating that I can't speed it up.Crescendo

© 2022 - 2025 — McMap. All rights reserved.