Fast hamming distance computation between binary numpy arrays
Asked Answered
S

5

14

I have two numpy arrays of the same length that contain binary values

import numpy as np
a=np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b=np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

I want to compute the hamming distance between them as fast as possible since I have millions of such distance computations to make.

A simple but slow option is this (taken from wikipedia):

%timeit sum(ch1 != ch2 for ch1, ch2 in zip(a, b))
10000 loops, best of 3: 79 us per loop

I have come up with faster options, inspired by some answers here on stack overflow.

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 6.94 us per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
100000 loops, best of 3: 2.43 us per loop

I'm wondering if there are even faster ways to compute this, possibly using cython?

Selfappointed answered 23/9, 2015 at 2:51 Comment(4)
Are the lengths of the example arrays a and b the same as the lengths of your real data?Karrah
Are you calculating all pairwise distances within an array of arrays, or between two arrays of arrays? You might be able to use scipy.spatial.distance.cdist or scipy.spatial.distance.pdistSwellfish
@WarrenWeckesser they are of the same order, yes. They will be between the length of 20 and 100 depending on some parameter settings.Selfappointed
scipy/spatial/distance.py hamming(u, v): ... return (u != v).mean() . See also bitarray.Sneak
B
25

There is a ready numpy function which beats len((a != b).nonzero()[0]) ;)

np.count_nonzero(a!=b)
Birkle answered 23/9, 2015 at 4:56 Comment(0)
R
8

Compared to 1.07µs for np.count_nonzero(a!=b) on my platform, gmpy2.hamdist gets its down to about 143ns after conversion of each array to an mpz (multiple-precison integer):

import numpy as np
from gmpy2 import mpz, hamdist, pack

a = np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b = np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

Based on a tip from @casevh, conversion from a 1D array of ones and zeros to a gmpy2 mpz object can be done reasonably efficiently with gmpy2.pack(list(reversed(list(array))),1).

# gmpy2.pack reverses bit order but that does not affect
# hamdist since both its arguments are reversed
ampz = pack(list(a),1) # takes about 4.29µs
bmpz = pack(list(b),1)

hamdist(ampz,bmpz)
Out[8]: 7

%timeit hamdist(ampz,bmpz)
10000000 loops, best of 3: 143 ns per loop

for relative comparison, on my platform:

%timeit np.count_nonzero(a!=b)
1000000 loops, best of 3: 1.07 µs per loop

%timeit len((a != b).nonzero()[0])
1000000 loops, best of 3: 1.55 µs per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
1000000 loops, best of 3: 1.7 µs per loop

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 5.8 µs per loop   
Roundabout answered 23/9, 2015 at 5:1 Comment(9)
To be fair, you should probably include the time it takes to convert the input arrays to mpz format.Karrah
You can use gmpy2.pack(list(a),1) to convert the numpy array to an mpz. It is faster than convert2mpz(). If you include the conversion time, it will still be slower than the numpy solutions.Micmac
@WarrenWeckesser: I thought about that and sort of agree. What bothers me is that numpy data is obviously in optimal format for a numpy solution while most hamming distance algorithms in C, which take numeric input of some kind, operate at the bit level. It seems to me that seriousness about hamming distance computations performing well implies not using an array to represent a sequence of bits, since that's only one number. The purpose of my answer is to provide a data point for just hamming distance performance with a reasonably well C coded Python module.Roundabout
@casevh: Thanks for the tip. Found it necessary to use gmpy2.pack(list(reversed(list(a))),1) which takes about 5.47µs on my platform.Roundabout
You do need to use reversed() if you want to construct the same mpz as your original code. However, the Hamming distance doesn't depend on the order of the bits (i.e. high-to-low versus low-to high). As long as the two arrays have the same length so that the same bit positions are compared against each other, the Hamming distance will be the same.Micmac
gmpy2 is THE way to go. The time has reduced from 1.5 us per hash to 0.21 us per hash (211 ns). This is a 7x speedup, only 5.44s on a 27m list. Calculated on an array of 16x16 boolean arrays VS python list of gmpy2 mpz's built from 256 length binary.Denominate
Anyone know if something's changed since this was posted? When I try to use pack after copy-pasting the imports and definitions of a & b in this post, I get the error: TypeError: pack() requires list elements be positive integers < 2^n bitsCondolent
I also have same problem as @DanielCraneLazarolazaruk
For me it worked changing list(a) to a.tolist()Torque
S
6

Using pythran can bring extra benefit here:

$ cat hamm.py
#pythran export hamm(int[], int[])
from numpy import nonzero
def hamm(a,b):
    return len(nonzero(a != b)[0])

As a reference (without pythran):

$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
100000 loops, best of 3: 4.66 usec per loop

While after pythran compilation:

$ python -m pythran.run hamm.py
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
1000000 loops, best of 3: 0.745 usec per loop

That's roughly a 6x speedup over the numpy implementation, as pythran skips the creation of an intermediate array when evaluating the element wise comparison.

I also measured:

def hamm(a,b):
    return count_nonzero(a != b)

And I get 3.11 usec per loop for the Python version and 0.427 usec per loop with the Pythran one.

Disclaimer: I'm one of the Pythran dev.

Scouring answered 23/9, 2015 at 5:41 Comment(0)
V
0

for strings it works faster

def Hamm(a, b):
    c = 0
    for i in range(a.shape[0]):
        if a[i] != b[i]:
            c += 1
    return c
Vallo answered 29/10, 2020 at 11:53 Comment(0)
O
0

I suggest you convert the numpy bit array into an numpy uint8 array using np.packbits

Have a look at scipy's spatial.distance.hamming: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html

otherwise, here is a small snippet which needs only numpy inspired by Fast way of counting non-zero bits in positive integer :

bit_counts = np.array([int(bin(x).count("1")) for x in range(256)]).astype(np.uint8)
def hamming_dist(a,b,axis=None):
    return np.sum(bit_counts[np.bitwise_xor(a,b)],axis=axis)

with axis=-1, this allows to take the hammig distance between an entry and a large array; eg:

inp = np.uint8(np.random.random((512,8))*255) #512 entries of 8 byte
hd = hamming_dist(inp, inp[123], axis=-1) #results in 512 hamming distances to entry 123
idx_best = np.argmin(hd)    # should point to identity 123
hd[123] = 255 #mask out identity
idx_nearest= np.argmin(hd)    # should point entry in list with shortest distance to entry 123
dist_hist = np.bincount(np.uint8(hd)) # distribution of hamming distances; for me this started at 18bits to 44bits with a maximum at 31
Oenomel answered 26/11, 2020 at 14:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.