Fast hamming distance computation between binary numpy arrays

Asked 23/9, 2015 at 2:51 Answered 26/11, 2020 at 14:5

Solved python arrays numpy cython hamming-distance

I have two numpy arrays of the same length that contain binary values

import numpy as np
a=np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b=np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

I want to compute the hamming distance between them as fast as possible since I have millions of such distance computations to make.

A simple but slow option is this (taken from wikipedia):

%timeit sum(ch1 != ch2 for ch1, ch2 in zip(a, b))
10000 loops, best of 3: 79 us per loop

I have come up with faster options, inspired by some answers here on stack overflow.

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 6.94 us per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
100000 loops, best of 3: 2.43 us per loop

I'm wondering if there are even faster ways to compute this, possibly using cython?

Selfappointed answered 23/9, 2015 at 2:51 Comment(4)

Are the lengths of the example arrays a and b the same as the lengths of your real data? – Karrah 23/9, 2015 at 3:11

Are you calculating all pairwise distances within an array of arrays, or between two arrays of arrays? You might be able to use scipy.spatial.distance.cdist or scipy.spatial.distance.pdist – Swellfish 23/9, 2015 at 3:13

@WarrenWeckesser they are of the same order, yes. They will be between the length of 20 and 100 depending on some parameter settings. – Selfappointed 23/9, 2015 at 3:14

scipy/spatial/distance.py hamming(u, v): ... return (u != v).mean() . See also bitarray. – Sneak 26/1, 2016 at 10:35

There is a ready numpy function which beats len((a != b).nonzero()[0]) ;)

np.count_nonzero(a!=b)

Birkle answered 23/9, 2015 at 4:56 Comment(0)

Compared to 1.07µs for np.count_nonzero(a!=b) on my platform, gmpy2.hamdist gets its down to about 143ns after conversion of each array to an mpz (multiple-precison integer):

import numpy as np
from gmpy2 import mpz, hamdist, pack

a = np.array([1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0])
b = np.array([1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1])

Based on a tip from @casevh, conversion from a 1D array of ones and zeros to a gmpy2 mpz object can be done reasonably efficiently with gmpy2.pack(list(reversed(list(array))),1).

# gmpy2.pack reverses bit order but that does not affect
# hamdist since both its arguments are reversed
ampz = pack(list(a),1) # takes about 4.29µs
bmpz = pack(list(b),1)

hamdist(ampz,bmpz)
Out[8]: 7

%timeit hamdist(ampz,bmpz)
10000000 loops, best of 3: 143 ns per loop

for relative comparison, on my platform:

%timeit np.count_nonzero(a!=b)
1000000 loops, best of 3: 1.07 µs per loop

%timeit len((a != b).nonzero()[0])
1000000 loops, best of 3: 1.55 µs per loop

%timeit len(np.bitwise_xor(a,b).nonzero()[0])
1000000 loops, best of 3: 1.7 µs per loop

%timeit np.sum(np.bitwise_xor(a,b))
100000 loops, best of 3: 5.8 µs per loop

Roundabout answered 23/9, 2015 at 5:1 Comment(9)

To be fair, you should probably include the time it takes to convert the input arrays to mpz format. – Karrah 23/9, 2015 at 11:48

You can use gmpy2.pack(list(a),1) to convert the numpy array to an mpz. It is faster than convert2mpz(). If you include the conversion time, it will still be slower than the numpy solutions. – Micmac 23/9, 2015 at 12:55

@WarrenWeckesser: I thought about that and sort of agree. What bothers me is that numpy data is obviously in optimal format for a numpy solution while most hamming distance algorithms in C, which take numeric input of some kind, operate at the bit level. It seems to me that seriousness about hamming distance computations performing well implies not using an array to represent a sequence of bits, since that's only one number. The purpose of my answer is to provide a data point for just hamming distance performance with a reasonably well C coded Python module. – Roundabout 23/9, 2015 at 16:41

@casevh: Thanks for the tip. Found it necessary to use gmpy2.pack(list(reversed(list(a))),1) which takes about 5.47µs on my platform. – Roundabout 23/9, 2015 at 17:27

You do need to use reversed() if you want to construct the same mpz as your original code. However, the Hamming distance doesn't depend on the order of the bits (i.e. high-to-low versus low-to high). As long as the two arrays have the same length so that the same bit positions are compared against each other, the Hamming distance will be the same. – Micmac 24/9, 2015 at 5:34

gmpy2 is THE way to go. The time has reduced from 1.5 us per hash to 0.21 us per hash (211 ns). This is a 7x speedup, only 5.44s on a 27m list. Calculated on an array of 16x16 boolean arrays VS python list of gmpy2 mpz's built from 256 length binary. – Denominate 7/4, 2016 at 18:3

Anyone know if something's changed since this was posted? When I try to use pack after copy-pasting the imports and definitions of a & b in this post, I get the error: TypeError: pack() requires list elements be positive integers < 2^n bits – Condolent 4/4, 2017 at 5:6

I also have same problem as @DanielCrane – Lazarolazaruk 31/7, 2019 at 7:9

For me it worked changing list(a) to a.tolist() – Torque 22/4, 2020 at 17:7

Using pythran can bring extra benefit here:

$ cat hamm.py
#pythran export hamm(int[], int[])
from numpy import nonzero
def hamm(a,b):
    return len(nonzero(a != b)[0])

As a reference (without pythran):

$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
100000 loops, best of 3: 4.66 usec per loop

While after pythran compilation:

$ python -m pythran.run hamm.py
$ python -m timeit -s 'import numpy as np; a = np.random.randint(0,2, 100); b = np.random.randint(0,2, 100); from hamm import hamm' 'hamm(a,b)'
1000000 loops, best of 3: 0.745 usec per loop

That's roughly a 6x speedup over the numpy implementation, as pythran skips the creation of an intermediate array when evaluating the element wise comparison.

I also measured:

def hamm(a,b):
    return count_nonzero(a != b)

And I get 3.11 usec per loop for the Python version and 0.427 usec per loop with the Pythran one.

Disclaimer: I'm one of the Pythran dev.

Scouring answered 23/9, 2015 at 5:41 Comment(0)

for strings it works faster

def Hamm(a, b):
    c = 0
    for i in range(a.shape[0]):
        if a[i] != b[i]:
            c += 1
    return c

Vallo answered 29/10, 2020 at 11:53 Comment(0)

I suggest you convert the numpy bit array into an numpy uint8 array using np.packbits

Have a look at scipy's spatial.distance.hamming: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html

otherwise, here is a small snippet which needs only numpy inspired by Fast way of counting non-zero bits in positive integer :

bit_counts = np.array([int(bin(x).count("1")) for x in range(256)]).astype(np.uint8)
def hamming_dist(a,b,axis=None):
    return np.sum(bit_counts[np.bitwise_xor(a,b)],axis=axis)

with axis=-1, this allows to take the hammig distance between an entry and a large array; eg:

inp = np.uint8(np.random.random((512,8))*255) #512 entries of 8 byte
hd = hamming_dist(inp, inp[123], axis=-1) #results in 512 hamming distances to entry 123
idx_best = np.argmin(hd)    # should point to identity 123
hd[123] = 255 #mask out identity
idx_nearest= np.argmin(hd)    # should point entry in list with shortest distance to entry 123
dist_hist = np.bincount(np.uint8(hd)) # distribution of hamming distances; for me this started at 18bits to 44bits with a maximum at 31

Oenomel answered 26/11, 2020 at 14:5 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags