The built-in int
is pretty nicely optimized, and it already supports &
, |
, and <<
.
There's at least one alternative implementation of arbitrary-length integers, based on GMP, known as gmpy2
. (There's also the original gmpy
, PyGMP
, Sophie
, and a few other wrappers around the same library, but I doubt they'd have any real performance differences.)
And there are two major implementations of the "bit array" idea, bitarray
(the one you linked) and bitstring
, as well as a few libraries like intbitset
that give you a set-like interface instead (which should also work for your uses).
So, let's toss them all together and compare:
import random
import struct
import timeit
import bitarray
import bitstring
import gmpy2
n = random.randrange((1<<31)+1, 1<<32)
bs = bitstring.pack('<q', n)
ba = bitarray.bitarray(64)
ba.frombytes(struct.pack('<q', n))
gm = gmpy2.mpz(n)
py = n
for x in 'bs', 'ba', 'gm', 'py':
def f(x=locals()[x]): x | x; x & x
t = timeit.timeit(f, number=10000)
print(x, t)
On my Mac, running Python.org 64-bit CPython 3.3.2, here's what I get:
bs 0.7623525890521705
ba 0.006623028079047799
gm 0.0016346259508281946
py 0.002280334010720253
That seems to be enough to summarily dismiss bitstring
.
Also, while I didn't show intbitset
here, because neither it nor any similar libraries I found work with Python 3, from a variety of comparisons (using intbitset.intbitset([i for i, bit in enumerate(bin(n)[2:]) if bit != '0'])
) it's anywhere from 14 to 70 times slower than int
in every test I throw at it, so I've summarily dismissed it as well.
So let's try the other three with more reps:
ba 6.385123810963705
gm 1.5937359740491956
py 2.129726824001409
And the numbers hold up. bitarray
is nowhere near as fast as built-in int
. It's also more clumsy to use (note that what should be an "alternate constructor" classmethod is an instance method, there's no fast and easy way to convert from or to an int, and the reason I was only testing x | x
and x & x
is that there are limitations on the operators). If you need an integer as an array of bits, it's great; if you need to do C-style bit-munging on an integer, that's not what it's best at.
As for gmpy2
, it seems to be about a third faster than the native int
. What if we make the numbers a whole lot bigger, like 1.5kbits?
gm 0.19562570203561336
py 0.29293217696249485
And here are the numbers with Apple Python 2.7.5:
('gm', 0.2890629768371582)
('py', 0.36592698097229004)
So, is it worth it? It's less friendly to use, it's slower rather than faster on some other operations that you didn't ask about, it requires a third-party C library which is LGPL-licensed, it has much worse error-handling behavior in memory overflow cases (as in, your app may segfault instead of raising), and there's at least one person on StackOverflow who is going to show up and tell you that GMP sucks whenever it gets mentioned (I believe because of a bug in an older version).
But if you need that extra speed, maybe it's worth it.
On the other hand, here's PyPy, 3.2.3/2.1.0b1 and PyPy 2.7.3/2.1.0, using the native int
type—compare with the 0.19562570203561336 and 0.2890629768371582 results from gmpy2 above:
py 0.2135779857635498
('py', 0.20878291130065918)
So, switching from CPython to PyPy gives you almost as much benefit as switching from int
to gmpy2.mpz
in Python 3, and significantly more benefit in Python 2. And it will almost certainly accelerate the rest of your code as well.
set
type is already implemented using a bloom filter. Might it be sufficient for your needs? – Spectroradiometerdict
. – Tobytobyenp.ndarray(dtype=np.bool_)
, should be a lot faster thanbitarray
, but page or cache misses could easily cancel out the benefits. That one probably isn't even worth testing without knowing what sizes you're actually interested in. – Tobytobye