Algorithm to test minimum hamming distance against a set?
Asked Answered
S

1

5

I have a relative straightforward thing I want to do:

  • Given a query number Q, a query distance d, and a set of numbers S, determine whether or not S contains any numbers with Hamming distance less than or equal to d.

The simplest solution is to just make S a list and iterate over it, computing distances. If a distance less than or equal d is computed, bail out an return TRUE.

But considering that all I want to do is check for an existence, something faster than a linear time solution should be possible.

One thing I tried is an M-tree. Referencing some other questions on stackoverflow, the wikipedia article (https://en.wikipedia.org/wiki/M-tree) and two pre-existing implementations, I spent several hours yesterday implementing a custom solution. One of the nice things about this problem is that it's actually cheaper to compute popcount over the XOR of two numbers (using an SSE instruction) than to store numbers that would allow avoidance of computing the metric, so there are several aspects of the solution that could be simplified and optimized for speed.

The results were very disappointing. It turns out that the metric radius I'm dealing with is small compared to the minimum Hamming distance. For instance, in the space of 12 bit numbers, the maximum Hamming distance is 12. If the minimum I'm looking for is 4, that doesn't leave much opportunity for good non-overlapping partitioning. In fact, I tried just that, creating by brute force a set of 12-bit numbers with min Hamming distance of 4 and then (by brute force) finding optimal binary tree partitioning so that a search algorithm could visit a minimum number of nodes. If I want to count the number of set elements within d of the query, I can't reduce the number of node visits below about 30% of the total, and stopping when I find the first has it visit about 4%. This means that I've more or less made a linear-time solution, where the overhead of the elaborate tree search algorithm is about the same as the savings from not having to check as many set members.

But what I want to do is very limited. I don't want to even count the number of set members with query distance <= d, much less enumerate them. I just want to check for existence. This makes me think about things like bloom filters and hashes.

I've also thought about trying to build a graph structure where set members are connected by edges with weights. Using the fact that Hamming distance respects triangle inequality, it seems to me there must be some way to search this graph such that edge traversals lead in a direction of likely smaller distance to the query, but I don't even really know where to start here.

Does anyone have any other suggestions for a solution here that might handily beat the performance of simply iterating an array?

EDIT and MOTIVATION:

Ultimately this comes from a coding theory question. For a given even number d and word size N, how many codes with min hamming distance d can I fit into an N-bit number? This allows the creation of a code that can detect errors of d/2 bits correct errors up to d/2-1 bits. We know about Shannon-limit codes like LDPC, but this is for long codes with nebulous min Hamming distance, and they take forever to decode. There are also multi-bit-error codes like OLSC that are fast to decode, but they're far from space-efficient. On the other hand, for d = 4, extended Hamming (SECDED) codes are optimally compact. I've seen BCH-based methods to make a DECTED code, but I don't know if they're optimal. To explore optimal encodings, what I wanted to do was generate alternative sets of codes of N bits with some arbitrary d and generate circuits to encode and decode them, selecting the most compact. I was also hoping to find some patterns that we might exploit for longer codes.

If this is (a) not already done, (b) feasible, and (c) someone would like to co-author a paper, please let me know. :)

Subterfuge answered 11/8, 2016 at 15:4 Comment(20)
Does S have any potentially useful properties? For example, is it likely that some bits of its members are constant (the same in every element)? Does it have a small representation as BDD or acyclic DFA?Bukovina
How large are the numbers in S? (It's best to view them each as a bitset -- the question then is how many elements are there in the universe.) How many bits (a fixed number? a fixed proportion?) are 1? How large is d, and does it vary per query or can you afford to preprocess to solve queries with fixed d quickly? How many numbers (bitsets) are there in S?Quietism
@harold I don't think the numbers have any regular structure. They're essentially random. I could provide more detail on the larger problem and maybe we could see if there is any higher-level organization, but I just don't know.Subterfuge
Well it can't hurt to have more informationBukovina
@Quietism For a given problem, the numbers in S are of fixed size. ONE of the things I want to know is how many numbers of min hamming distance d can fit in some number of bits (i.e. 20 in 8 bits). d will also be constant for a given problem.Subterfuge
@harold An extended Hamming (SECDED) code is optimally compact for codes with min Hamming dist 4. I've read lots of papers that assume the existence of a DECTED code (min H dist 6) without actually showing they know it can be implemented. Of course, I've also seen BCH-based implementations, but I'm not sure they're optimal. I have some ideas for searching circuit spaces for encoders/decoders for ARBITRARY min H distance, but first I need a way to generate set of code sets to evaluate. In the long run, I'm hoping to see a pattern I can exploit.Subterfuge
@TimothyMiller: Thanks, but please edit your question to add this important info. Regarding how many distinct numbers of minimum Hamming distance d can fit in some given number of bits: this is addressed in coding theory. You're essentially asking for the largest number of code symbols that can be used if you want to be able to decode a symbol that could contain up to d-1 erroneous bits.Quietism
Also: "the numbers in S are of fixed size" doesn't help -- how big is that size?Quietism
@j_random_hacker: I want to try different numbers of bits. Right now, I can get up to 16 with brute force and a lot of time. But to see any patterns, I have to try lots of different numbers of bits. Note that there are some truly optimal numbers of bits. For instance, for d=4, where p is the number of parity bits, and c is the word size I want to encode, the best configurations are where c = 2^(p-1)-p. I.e. for 7 parity bits, you can make an optimal SECDED code for at most 57 data bits.Subterfuge
Couldn't you just throw a SAT solver at this though?Bukovina
@harold: If I could throw a SAT solver at this, I wouldn't know how. For one thing, for d=6, I don't even know what I'd be solving for. Also, I'm a chip designer, so some of this data structures, complexity theory, and algorithms stuff gets a little over my head sometimes.Subterfuge
Well for the original problem, you could make a horizontal summation circuit, xor with Q, allow only inputs from S, then solve for "hsum <= d". But I suspect a wider problem can be attacked here, perhaps straight up solving for the required set S. Takes a bit of thinking that I haven't done yetBukovina
At the moment, I'm just working on the coding theory part. The circuits will come later. Something I'd need is the ability to enumerate alternate sets of codes. However, just counting 0 to 2^n-1 and accumulating a list in that order (keeping those compatible with the list so far) seem to work best, but that only yields one possible code set.Subterfuge
OK, so I see now that your set sizes will be pretty small (e.g. 57) and so is your universe size (e.g. 20). For so few bitsets, I think the naive approach (XOR with each and popcount) will be probably faster than anything fancier involving special indexes or filtering (which TTBOMK can't change the asymptotic complexity anyway). The real potential to speed up your high-level goal is to find dominance rules that allow you to avoid generating useless "symmetrical" partial solutions. So are you using branch and bound to try to build maximum-size sets of bitsets? If so I have some ideas.Quietism
And, BTW, I suspect something like this has already been done -- simply because it seems like a good idea! So while I would be interested in co-authoring a paper, I don't think the chances are too good... Have you checked the existing literature?Quietism
@j_random_hacker: I don't usually think of a space of 2^57 numbers as being small. And the 20 I mentioned was for a space of 2^8. So I really think I need a better method to search a set of codes accumulated so far as we're generating a code set. Of course, a space that large isn't something I can enumerate in the RAM space I have, but if there are patterns with smaller numbers of bits, perhaps we can make predictions.Subterfuge
@j_random_hacker: Oh, and I have checked existing literature. I haven't found much in line with what I'm after. There are some BCH-based codes that correct multiple errors, as well as OLSC. But the shannon-limit codes are always done probabilistically on very long code words.Subterfuge
A space of 2^57 is huge, but the test you originally asked for is very quickly solved by the naive XOR-and-popcount-with-every-bitset-so-far algorithm if you have at most 57 bitsets, each 20 bits long.Quietism
Also it seems, based on your comment beginning "At the moment, I'm just", that you are greedily (i.e., not exhaustively) searching the space of all sets of bitsets. That won't guarantee you a maximum-size set of bitsets (though this would probably be impossible for a universe size of 20 -- the raw configuration space here is 2^(2^20) objects, and even with the dominance relations I have in mind for branch and bound, it might still take millenia).Quietism
Let us continue this discussion in chat.Subterfuge
J
2

I think that problem may be resolved by the splitting each numbers from S to substrings such that the query results must have at least 1 partition whose Hamming distance is no more than 1 with the corresponding partitions of the query.

This algorithm is described in the article: Alex X. Liu, Ke Shen, Eric Torng. Large scale Hamming distance query processing, 2011. The authors are called the algorithm as HEngine. I try to explain some intuition.

Lets N - bit count of the number (it dimensionality)

k - query Hamming distance

r-cut(α) - function of splitting number α into r substring {α1, α2, ..., αr} where the first r − (m mod r) substrings have length ⌊m/r⌋ and the last m mod r substrings have length ⌈m/r⌉

The algorithm is based on the theorem:

For any two binary strings β and γ such that HD(β, γ) ≤ k, consider r-cut(β) and r-cut(γ) where r ≥ ⌊k/2⌋ + 1. It must be the case that HD(βi, γi) ≤ 1 for at least q = r − ⌊k/2⌋ different values of i.

For example, we have binary string of length N = 8 bits. And we would like to find substrings with k = 2.

α = 10001110
β = 10100110
HD(α, β) = 2

Then minimum value of r = ⌊2/2⌋ + 1 = 2. In this case r-cut(α,β) produces 2 substrings of length 4 bits:

    α1 = 1000    α2 = 1110
    β1 = 1010    β2 = 0110
HD(α1, β1) = 1,  HD(α2, β2) = 1

q = 2 - ⌊2/2⌋ = 1.

Also the authors introduced the next theorem:

Consider any string β ∈ T such that HD(α, β) ≤ k. Given any r ≥ ⌊k/2⌋ + 1, it follows that at least one signature β-signature matches its compatible signature α-signature.

The basic idea of the algorithm is to preprocess S to facilitate finding all strings β in S that satisfy the signature match property and then verify which of these strings actually are within Hamming distance k of α.

I suppose you should prepare the set of S to subtables using HEngine algorithm, and split Q to partitions the same way. And then perform the search by corresponding partitions taking into account that the Hamming distance is no more than 1 with the corresponding partitions.

Please I advise you to see more details in the article.

Jospeh answered 13/8, 2016 at 9:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.