Finding the minimum Hamming distance in less than O(n^2m) time
Asked Answered
S

2

9

If you have n binary strings, each of length m, is there a faster way to determine the minimum Hamming distance between any pair than to compare all O(n^2) pairs and for each to compute their Hamming distance?

That is can it be done in less than O(n^2m) time?

Apart from anything else and as commented below, the Hamming distance is a proper distance function and so satisfies the triangle inequality, which makes me feel there should be a faster solution.

Selfknowledge answered 14/4, 2018 at 9:16 Comment(6)
There are plenty of papers and other sources on the www. Try google! What have you tried so far?Diverge
@Diverge I have tried a web search for "minimum hamming distance" but with no luck so far.Selfknowledge
It's a distance so it verifies d(a,c) ≤ d(a,b)+d(b,c), and that could be certainly used not to test every pair.Dysarthria
This may also helps you (you have to find a good ordering): #38900504Dysarthria
@MrSmith42: n binary strings have (n-1)*n/2 Hamming distances, as the Hamming distance is defined between two strings. The efficient nearest neighbor search in such a Hamming space is not trivial. My Google search at least has not been fruitful so far.Rattlehead
You could construct a kd-tree and (for modest values of m) use a lookup-table to determine the distance between two strings. This leads to complexity O(n log n).Rattlehead
A
5

Consider using Locality Sensitive Hashing, which is a general technique that can be applied to certain distance metrics including Hamming distance. Excerpt from Wikipedia:

LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items).

In short, you can use LSH to obtain buckets, brute-force Hamming distances within each bucket, and output the smallest distance found. To obtain the right answer with higher probability, you can tweak parameters of the LSH algorithm and/or run LSH multiple times (to get different allocations of items to buckets). I believe you can get arbitrarily close to the correct (optimal) answer with a failure rate exponentially-decreasing in runtime. (You might have to binary search over the LSH parameters, if your Hamming distances are all very close, but you'll still avoid computing n^2 Hamming distances.)

The algorithm and analysis are pretty involved, so I don't think I can write a complete summary here at the moment (it's about 2-3 hours worth of lecture material). I recommend taking a look at the lecture notes/slides here, here, and here; they all cover LSH (in varying degrees of detail) with some mention of Hamming distance.

Ancestress answered 20/4, 2018 at 20:18 Comment(0)
N
-3

It's not possible to determine the true minimum without performing a full search with O(n^2m). All faster variants will yield only a maybe best minimum.

Kind of proof:

1. Assume there would be a faster solution.
2. Then for one or more combinations the hamming distance is not computed.
3. Omitting a combination means, that there is a criteria to decide
   the combination can't be better than the current best minimum.
4. There is no know criteria.

The triangle inequality is unfortunately helping only to shorten the computation of the true maximum:

  1. compute the distances Di0, sort them and pick a starting maximum.
  2. now omit all Dij where Di0 + D0j <= current maximum.
Now answered 23/4, 2018 at 7:47 Comment(9)
Do you have justification for your first sentence?Ancestress
I also wonder that, and it seems easy to find examples where you don't have to compare all strings. If the "first" two strings have a hamming distance ==1 you only need to verify that there are no duplicate strings, which should be O(n*m). I don't know if that can be generalized.Airspeed
What if the last comparison yields the hamming distance == 1? An if-I-am-lucky example is not a general rule. Depending on further constraints faster solutions are possible. But without any constrain there is no way to deduce an lower limit for Dij. It's an upper limit and this only helps finding the maximum.Now
Your "kind of proof" seems incredibly shaky -- I don't see how you come to the conclusion that there is no criteria that lets you safely skip a distance computation.Ancestress
No, that's not how it works. You are claiming that no such criteria exists. The burden of proof is on you to disprove the existence of such criteria.Ancestress
First provide the proof that God either exists or not. Your choice. How do you prof that's impossible to compute the division x=a/b by only using the plus operator?Now
I don't see how being pedantic over something the poster already qualified with "kind of" makes this answer incorrect. If someone has an actual O(n^2m) answer that isn't a "maybe best" then please, post it.Feeder
@cz I have no problem with a "kind of" proof -- my issue with this answer is that the first sentence ("it's not possible...") is stated as a fact, which I think is likely false.Ancestress
@pkpnd it is known to compute a division if you have sub or add+neg. But with only add....Now

© 2022 - 2024 — McMap. All rights reserved.