Fastest way to get hamming distance for integer array

Asked 29/11, 2016 at 20:34 Answered 29/11, 2016 at 22:16

Solved python numpy binary xor hamming-distance

Let a and b be vectors of the same size with 8-bit integers (0-255). I want to compute the number of bits where those vectors differs i.e. a Hamming distance between vectors formed by concatenation of binary representations of those numbers. For example:

a = [127,255]
b= [127,240]

Using numpy library

np.bitwise_xor(a,b)
# Output: array([ 0, 15])

What I need is now to binary represent each element of the above array and count number of 1s in all the elements of the array. The above example will give hamming distance of 0+4 = 4. Any fast and elegant solution for this in Python?

Nels answered 29/11, 2016 at 20:34 Comment(6)

Won't that be 0 + 1 instead because 254 is all 1s except at one-bit, whereas 255 is all 1s? – Neighborly 29/11, 2016 at 20:53

Probably just take a standard popcount recipe, broadcast it over the array, and sum the result. You might be able to get a speedup by viewing the array's buffer as a larger dtype. – Thriller 29/11, 2016 at 20:53

@Neighborly It was a typo from my end. Good Catch. Updated the number to 240 in the sample data. – Nels 29/11, 2016 at 21:1

What is the typical length of the vectors a and b? – Somewhere 29/11, 2016 at 21:22

@WarrenWeckesser Actual data example is below:a = [34, 200, 96, 158, 75, 208, 158, 230, 151, 85, 192, 131, 40, 142, 54, 64, 75, 251, 147, 195, 78, 11, 62, 245, 49, 32, 154, 59, 21, 28, 52, 222] b = [128, 129, 2, 129, 196, 2, 168, 101, 60, 35, 83, 18, 12, 10, 104, 73, 122, 13, 2, 176, 114, 188, 1, 198, 12, 0, 154, 68, 5, 8, 177, 128] – Nels 29/11, 2016 at 21:24

In one run of your program, how many times do you compute a Hamming distance? Just once? A few times? Thousands of times? – Somewhere 29/11, 2016 at 22:1

Approach #1 : We could broadcast them into binary bits & count number of different bits, like so -

def hamming_distance(a, b):
    r = (1 << np.arange(8))[:,None]
    return np.count_nonzero( (a & r) != (b & r) )

Sample run -

In [144]: a = [127,255]
     ...: b = [127,240]
     ...: 

In [145]: hamming_distance(a, b)
Out[145]: 4

Approach #2 : Using bitwise-xor operation, we can find out the number of different binary bits between a and b -

def hamming_distance_v2(a, b):
    r = (1 << np.arange(8))[:,None]
    return np.count_nonzero((np.bitwise_xor(a,b) & r) != 0)

Neighborly answered 29/11, 2016 at 21:1 Comment(4)

Approach 2 is throwing exception: TypeError: unsupported operand type(s) for -: 'list' and 'list' – Nels 29/11, 2016 at 21:26

@DebasishMitra Added a better one with xor there. – Neighborly 29/11, 2016 at 22:4

hamming_distance_v2() is twice as fast as hamming_distance(). – Aucoin 1/11, 2019 at 13:33

For both versions, mismatched input sizes and/or values beyond [0,255] can produce undesired results. Better to use these sanity checks: assert np.all(a >= 0) assert np.all(a <= 255) assert np.all(b >= 0) assert np.all(b <= 255) assert a.shape == b.shape – Aucoin 2/11, 2019 at 10:41

If you are going to call the distance function many times during one execution of your program, you can gain some speed by using a precomputed table of bit counts. Here's (yet another) version of the Hamming distance function:

# _nbits[k] is the number of 1s in the binary representation of k for 0 <= k < 256.
_nbits = np.array(
      [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3,
       4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4,
       4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2,
       3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5,
       4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4,
       5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3,
       3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2,
       3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
       4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5,
       6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5,
       5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6,
       7, 7, 8], dtype=np.uint8)


def hamming_distance1(a, b):
    c = np.bitwise_xor(a, b)
    n = _nbits[c].sum()
    return n

In the following, a and b are the Python lists of length 32 given in a comment to the question. divakar_hamming_distance() and divakar_hamming_distance_v2() are from @Divakar's answer.

Here are the timings of @Divakar's functions:

In [116]: %timeit divakar_hamming_distance(a, b)
The slowest run took 5.57 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 11.3 µs per loop

In [117]: %timeit divakar_hamming_distance_v2(a, b)
The slowest run took 5.35 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.3 µs per loop

hamming_distance1(a, b) is a bit faster:

In [118]: %timeit hamming_distance1(a, b)
The slowest run took 6.04 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.42 µs per loop

On my computer, initializing _nbits takes about 11 µs, so there is no advantage to using hamming_distance1 if you only call the function once. If you call it three or more times, there is a net gain in performance.

If the inputs are already numpy arrays, all the functions are significantly faster:

In [119]: aa = np.array(a)

In [120]: bb = np.array(b)

In [121]: %timeit divakar_hamming_distance_v2(aa, bb)
The slowest run took 8.22 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.72 µs per loop

In [122]: %timeit hamming_distance1(aa, bb)
The slowest run took 12.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.77 µs per loop

Of course, if you always do that immediately before you compute the Hamming distance, the time to do the conversion must be included in the overall timing. However, if you write the code that generates a and b to take advantage of numpy earlier, you might already have them as numpy arrays by the time you compute the Hamming distance.

(I also experimented a bit with a 2-d array of precomputed Hamming distances between 8 bit values--an array with shape (256, 256)--but the initialization cost is higher and the performance gains are small.)

Somewhere answered 29/11, 2016 at 22:16 Comment(1)

Without Numba: divakar_hamming_distance(aa, bb) 5.8 µs per loop hamming_distance1(aa, bb) 15.4 µs per loop With Numba: divakar_hamming_distance(aa, bb) 1.2 µs per loop hamming_distance1(aa, bb) 896 ns per loop – Aucoin 2/11, 2019 at 9:27

maybe not the most efficient way, but the easiest imo is to convert your ouptut array to strings in binary form, then take the sum of all the characters converted back to ints...

import numpy as np

output = np.random.randint(0,63,10)
hamming = ['{:b}'.format(x).count('1') for x in output]

Counterspy answered 29/11, 2016 at 20:50 Comment(0)

sum(bin(x).count("1") for x in np.bitwise_xor(a,b))

Bernini answered 29/11, 2016 at 21:21 Comment(0)

Recommended topics

Hot tags