I have two numpy arrays of the same length that contain binary values
import numpy as np
a=np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b=np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])
I want to compute the hamming distance between them as fast as possible since I have millions of such distance computations to make.
A simple but slow option is this (taken from wikipedia):
%timeit sum(ch1 != ch2 for ch1, ch2 in zip(a, b))
10000 loops, best of 3: 79 us per loop
I have come up with faster options, inspired by some answers here on stack overflow.
%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 6.94 us per loop
%timeit len(np.bitwise_xor(a,b).nonzero()[0])
100000 loops, best of 3: 2.43 us per loop
I'm wondering if there are even faster ways to compute this, possibly using cython?
a
andb
the same as the lengths of your real data? – Karrahscipy.spatial.distance.cdist
orscipy.spatial.distance.pdist
– Swellfishreturn (u != v).mean()
. See also bitarray. – Sneak