Python - How to generate the Pairwise Hamming Distance Matrix

Asked 12/3, 2017 at 20:18 Answered 12/3, 2017 at 20:53

Solved python numpy vectorization hamming-distance

beginner with Python here. So I'm having trouble trying to calculate the resulting binary pairwise hammington distance matrix between the rows of an input matrix using only the numpy library. I'm supposed to avoid loops and use vectorization. If for instance I have something like:

   [ 1,  0,  0,  1,  1,  0]
   [ 1,  0,  0,  0,  0,  0]
   [ 1,  1,  1,  1,  0,  0]

The matrix should be something like:

   [ 0,  2,  3]
   [ 2,  0,  3]
   [ 3,  3,  0]

ie if the original matrix was A and the hammingdistance matrix is B. B[0,1] = hammingdistance (A[0] and A[1]). In this case the answer is 2 as they only have two different elements.

So for my code is something like this

def compute_HammingDistance(X):

     hammingDistanceMatrix = np.zeros(shape = (len(X), len(X)))
     hammingDistanceMatrix = np.count_nonzero ((X[:,:,None] != X[:,:,None].T))
     return hammingDistanceMatrix

However it seems to just be returning a scalar value instead of the intended matrix. I know I'm probably doing something wrong with the array/vector broadcasting but I can't figure out how to fix it. I've tried using np.sum instead of np.count_nonzero but they all pretty much gave me something similar.

Satiable answered 12/3, 2017 at 20:18 Comment(0)

Try this approach, create a new axis along axis = 1, and then do broadcasting and count trues or non zero with sum:

(arr[:, None, :] != arr).sum(2)

# array([[0, 2, 3],
#        [2, 0, 3],
#        [3, 3, 0]])

def compute_HammingDistance(X):
    return (X[:, None, :] != X).sum(2)

Explanation:

1) Create a 3d array which has shape (3,1,6)

arr[:, None, :]
#array([[[1, 0, 0, 1, 1, 0]],
#       [[1, 0, 0, 0, 0, 0]],
#       [[1, 1, 1, 1, 0, 0]]])

2) this is a 2d array has shape (3, 6)

arr   
#array([[1, 0, 0, 1, 1, 0],
#       [1, 0, 0, 0, 0, 0],
#       [1, 1, 1, 1, 0, 0]])

3) This triggers broadcasting since their shape doesn't match, and the 2d array arr is firstly broadcasted along the 0 axis of 3d array arr[:, None, :], and then we have array of shape (1, 6) be broadcasted against (3, 6). The two broadcasting steps together make a cartesian comparison of the original array.

arr[:, None, :] != arr 
#array([[[False, False, False, False, False, False],
#        [False, False, False,  True,  True, False],
#        [False,  True,  True, False,  True, False]],
#       [[False, False, False,  True,  True, False],
#        [False, False, False, False, False, False],
#        [False,  True,  True,  True, False, False]],
#       [[False,  True,  True, False,  True, False],
#        [False,  True,  True,  True, False, False],
#        [False, False, False, False, False, False]]], dtype=bool)

4) the sum along the third axis count how many elements are not equal, i.e, trues which gives the hamming distance.

Adolfo answered 12/3, 2017 at 20:29 Comment(3)

I replaced the original i had in my code with hammingDistanceMatrix = np.count_nonzero((X[:, None, :] != X).sum(2)) and it still seems to give me the same result, a single scalar value. – Satiable 12/3, 2017 at 20:48

you don't need the np.count_nonzero here. the sum does it for you. Just return hammingDistanceMatrix = (arr[:, None, :] != arr).sum(2) should be fine. – Adolfo 12/3, 2017 at 20:49

Oh, that worked! Thanks a lot. Would it be possible if you could explain why the broad cast is done arr[:,None,:] like that, and why .sum(2)? – Satiable 12/3, 2017 at 21:0

For reasons I do not understand this

(2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2)

appears to be much faster than @Psidom's for larger arrays:

a = np.random.randint(0,2,(100,1000))
timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
# 2.297890231013298
timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
# 0.10616962902713567

Psidom's is a bit faster for the very small example:

a
# array([[1, 0, 0, 1, 1, 0],
#        [1, 0, 0, 0, 0, 0],
#        [1, 1, 1, 1, 0, 0]])

timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
# 0.0004370050155557692
timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
# 0.00068191799800843

Update

Part of the reason appears to be floats being faster than other dtypes:

timeit(lambda: (0.5 * np.inner(2*a-1, 1-2*a) + a.shape[1] / 2), number=100)
# 0.7315902590053156
timeit(lambda: (0.5 * np.inner(2.0*a-1, 1-2.0*a) + a.shape[1] / 2), number=100)
# 0.12021801102673635

Nimiety answered 12/3, 2017 at 20:53 Comment(2)

That's a clever way of comparing elements and counting them in one go. I think the main reason floating point dtypes are faster in this case is that BLAS routines are only available for those dtypes. All other dtypes are handled by less optimized numpy C code. – Roath 12/3, 2017 at 22:23

@Roath Wow, I wasn't aware BLAS made that much of a difference! – Nimiety 13/3, 2017 at 0:17

Recommended topics

Hot tags