Sorting strings so that hamming distance is low between adjacent strings
Asked Answered
P

2

12

Problem:

I have N (~100k-1m) strings each D (e.g. 2000) characters long and with a low alphabet (eg 3 possible characters). I would like to sort these strings such that there are as few possible changes between adjacent strings (eg hamming distance is low). Solution doesn't have to be the best possible but closer the better.

Example

N=4
D=5
//initial strings
1. aaacb
2. bacba
3. acacb
4. cbcba

//sorted so that hamming distance between adjacent strings is low
1. aaacb
3. acacb (Hamming distance 1->3 = 1)
4. cbcba (Hamming distance 3->4 = 4)
2. bacba (Hamming distance 4->2 = 2)

Thoughts about the problem

I have a bad feeling this is a non trivial problem. If we think of each string as a node and the distances to other strings as an edge, then we are looking at a travelling salesman problem. The large number of strings means that calculating all of the pairwise distances beforehand is potentially infeasible, I think turning the problem into some more like the Canadian Traveller Problem.

At the moment my solution has been to use a VP tree to find a greedy nearest neighbour type solution to the problem

curr_string = a randomly chosen string from full set
while(tree not empty)
    found_string = find nearest string in tree
    tree.remove(found_string)
    sorted_list.add(curr_string)
    curr_string = found_string

but initial results appear to be poor. Hashing strings so that more similar ones are closer may be another option but I know little about how good a solution this will provide or how well it will scale to data of this size.

Philosophy answered 28/12, 2011 at 13:22 Comment(0)
P
2

Even if you consider this problem as similar to the travelling salesman problem (TSP), I believe that Hamming distances will follow the triangle inequality (Hamming(A,B) + Hamming(B,C) ≤ Hamming(A,C)), so you're only really dealing with ∆TSP (the metric travelling salesman problem), for which there are a number of algorithms which give good approximations at an ideal result. In particular, the Christofides algorithm will always give you a path of at most 1.5x the minimum possible length.

Pictish answered 5/1, 2012 at 7:19 Comment(0)
C
1

Yes this is a Traveling salesman problem, but I don't know if any of the dozen programs under TSP source code library can do 1M points straight up, with a plug-in metric.

A possible 2-stage approach:

1) split the 1M points into 50 clusters with a Nearest neighbor search. Do TSP on the 50 cluster centres.

2) put all the 1M - 50 points between the 2 nearest centres; do TSP on each string of 1M / 50. Here "50" could be 100 or 1000. If 1000 is too big, recurse: split 1000 into 30 clusters of ~ 30 each.

K-means can cluster 1M points, but again I don't know of a fast implementation with plug-in metric. See however scikit-learn clustering

To find a centroid of N points, one which minimizes |centre - all others|, you can afaik beat O(N^2) only by taking the best of a random sample of say sqrt(N) -- should be good enough. (Or google / ask a separate question on fast approximate centroid).

First pack the data tightly to save memory accesses in the whole flow. In this case, encode a b c as 00 01 10 (Hamming distance beween each pair = 1): 2000 x 2 bits = 500 bytes. Fwiw, finding min Hammingdist( 4k bits, 10k x 4k ) takes ~ 40 msec on my mac ppc.

Cottrell answered 5/1, 2012 at 9:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.