I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is
dmin=len(trans[0])
for i in xrange(len(trans)):
for j in xrange(i+1,len(trans)):
dist=hamdist(trans[i][:-1], trans[j][:-1])
if dist < dmin:
dmin = dist
Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is
def hamdist(str1, str2):
diffs = 0
if len(str1) != len(str2):
return max(len(str1),len(str2))
for ch1, ch2 in zip(str1, str2):
if ch1 != ch2:
diffs += 1
return diffs
itertools
goodness; your nested loops can be justfor s1, s2 in combinations(trans, 2)
. Thehamdist
function could usereturn sum(islice(1 for ch1, ch2 in izip(str1, str2) if ch1 != ch2), prevMin))
– Garretgarreth