Fast computation of pairs with least hamming distance
Asked Answered
A

1

4

Problem

Suppose you have N (~100k-1m) integers/bitstrings each K (e.g. 256) bits long. The algorithm should return the k pairs with the lowest pairwise Hamming distance.

Example

N = 4
K = 8
i1 = 00010011
i2 = 01010101
i3 = 11000000
i4 = 11000011


HammingDistance(i1,i2) = 3
HammingDistance(i1,i3) = 5
HammingDistance(i1,i4) = 3
HammingDistance(i2,i3) = 4
HammingDistance(i2,i4) = 4
HammingDistance(i3,i4) = 2

For k=1 it should return the pairlist {(i3,i4)}. For k=3 it should return {(i1,i2), (i1,i4), (i3,i4)}. And so on.

Algorithm

The naive implementation computes all pairwise distances, sorts the pairs and returns the k with the lowest distance: O(N^2). Are there any better data structures or algorithms? It looks like the ideas from Efficiently find binary strings with low Hamming distance in large set can not be used since there is no single query integer.

Aerobiology answered 16/8, 2011 at 22:58 Comment(3)
Do you know anything about how close to expect the closest pairs to be?Oler
Usually there are pairs which have a distance of zero or one bit.Aerobiology
Can you post a representative data set? Are you okay with getting no matches back if there is a distance of more than 2 (or 5, or ..) between the closest pairs?Oler
G
7

The recent paper "The Closest Pair Problem under the Hamming Metric" has only algorithms involving an n^2 factor (unless K is very large). That is even for finding only a single pair. So it seems that it is hard to improve this unless you make further assumptions about the structure of your instances. For example, if you assume the Hamming distance is not very large, you could sample a few columns, hash the strings into buckets according to these under the assumptions that these columns match exactly, and then do pairwise comparison in each bucket separately. Repeat this for another set of random columns to minimize the probability you miss some pairs.

Gallardo answered 17/8, 2011 at 7:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.