Finding Minimum hamming distance of a set of strings in python

Asked 8/7, 2014 at 5:38 Answered 8/7, 2014 at 7:34

Solved python algorithm bigdata hamming-distance

I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist < dmin:
                    dmin = dist

Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is

def hamdist(str1, str2):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
          diffs += 1
    return diffs

Monocotyledon answered 8/7, 2014 at 5:38 Comment(4)

There isn't much to do about the number of comparisons besides optimizing the hamming function. However, if you tell us what you're trying to achieve, perhaps there's a heuristic solution that doesn't go through all the comparisons – Causation 8/7, 2014 at 6:10

Thanks. I have to find if the minimum distance of a set of DNA sequences is above a threshold. (If it is, then I know that an estimation algorithm has returned reliable values.) – Monocotyledon 8/7, 2014 at 6:29

You could shorten your code using some itertools goodness; your nested loops can be just for s1, s2 in combinations(trans, 2). The hamdist function could use return sum(islice(1 for ch1, ch2 in izip(str1, str2) if ch1 != ch2), prevMin)) – Garretgarreth 8/7, 2014 at 8:3

@FrerichRaabe Thanks a lot. Itertools helped me speed up my implementation significantly. – Monocotyledon 8/7, 2014 at 18:21

You could optimize your hamdist function by adding an optional parameter containing the minimum distance you have got so far, this way if diffs reaches that value you stop calculating the distance because this comparison will give you a greater distance than the minimum:

def hamdist(str1, str2,prevMin=None):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
            diffs += 1
            if prevMin is not None and diffs>prevMin:
                return None
    return diffs

You will need to adapt your main loop to work with None return value from hamdist:

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist is not None and dist < dmin:
                    dmin = dist

Octonary answered 8/7, 2014 at 6:12 Comment(1)

Thanks. I had not thought of optimising hamdist(). It looks like the only thing I can do here. – Monocotyledon 8/7, 2014 at 6:29

Some ideas:

1) sklearn.metrics.hamming_loss is probably much more efficient than your implementation, even if you have to convert your strings to arrays.

2) Are all your strings unique? If so remove the duplicates.

You can also try sklearn.metrics.pairwise.pairwise_distances, for example:

In [1]: from sklearn.metrics.pairwise import pairwise_distances

In [2]: from sklearn.metrics import hamming_loss

In [3]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])

In [4]: import numpy as np

In [5]: a = np.array([[3,4,5], [3,4,4],[3,1,1]])

In [6]: pairwise_distances(metric=hamming_loss)

In [7]: pairwise_distances(a, metric=hamming_loss)
Out[7]: 
array([[ 0.        ,  0.33333333,  0.66666667],
       [ 0.33333333,  0.        ,  0.66666667],
       [ 0.66666667,  0.66666667,  0.        ]])

I am not seeing a flag that would only calculate upper-triangle, but this still should be faster than looping.

Roosevelt answered 8/7, 2014 at 6:3 Comment(2)

All the strings in the array are given to be distinct. I only compare string i with string j when i < j. So I do not understand 3). Thanks. I'll look at 1). – Monocotyledon 8/7, 2014 at 6:7

Isn't the OP already looking at the upper left triangle? his second loop is for j in xrange(**i+1**,len(trans)): – Causation 8/7, 2014 at 6:8

As mentioned in this answer, there is no general way to get better than the quadratic running time. You need to exploit the structure of the data. For example, if the threshold t for maximum allowed Hamming distance is small compared to the length of the strings n (e.g. t=100, n=1000000), you can do the following: randomly select k columns (e.g. k=1000), restrict the strings to these columns, and hash them into buckets. You then need to do the pairwise comparison only within each bucket, under the assumption that the two strings with minimum Hamming distance mismatch only in nonselected columns. For the example, this is true with about 90% probability, and you can get the error probability arbitrarily low by repeating the process.

Misusage answered 8/7, 2014 at 7:34 Comment(0)

-1

find the hamming distances of all strings and store it in an array. some thing like

    distance=[]
    for i in trans:
      distance.append(hamdist(i))

then caluclate the min of them like

    minimum =min(distance)

Esoterica answered 8/7, 2014 at 5:50 Comment(1)

hamming distance is between two strings, not a property of a string – Causation 8/7, 2014 at 6:5

Recommended topics

Hot tags