I have a relative straightforward thing I want to do:
- Given a query number
Q
, a query distanced
, and a set of numbersS
, determine whether or notS
contains any numbers with Hamming distance less than or equal tod
.
The simplest solution is to just make S
a list and iterate over it, computing distances. If a distance less than or equal d is computed, bail out an return TRUE
.
But considering that all I want to do is check for an existence, something faster than a linear time solution should be possible.
One thing I tried is an M-tree
. Referencing some other questions on stackoverflow, the wikipedia article (https://en.wikipedia.org/wiki/M-tree) and two pre-existing implementations, I spent several hours yesterday implementing a custom solution. One of the nice things about this problem is that it's actually cheaper to compute popcount over the XOR of two numbers (using an SSE instruction) than to store numbers that would allow avoidance of computing the metric, so there are several aspects of the solution that could be simplified and optimized for speed.
The results were very disappointing. It turns out that the metric radius I'm dealing with is small compared to the minimum Hamming distance. For instance, in the space of 12 bit numbers, the maximum Hamming distance is 12. If the minimum I'm looking for is 4, that doesn't leave much opportunity for good non-overlapping partitioning. In fact, I tried just that, creating by brute force a set of 12-bit numbers with min Hamming distance of 4 and then (by brute force) finding optimal binary tree partitioning so that a search algorithm could visit a minimum number of nodes. If I want to count the number of set elements within d of the query, I can't reduce the number of node visits below about 30% of the total, and stopping when I find the first has it visit about 4%. This means that I've more or less made a linear-time solution, where the overhead of the elaborate tree search algorithm is about the same as the savings from not having to check as many set members.
But what I want to do is very limited. I don't want to even count the number of set members with query distance <= d
, much less enumerate them. I just want to check for existence. This makes me think about things like bloom filters and hashes.
I've also thought about trying to build a graph structure where set members are connected by edges with weights. Using the fact that Hamming distance respects triangle inequality, it seems to me there must be some way to search this graph such that edge traversals lead in a direction of likely smaller distance to the query, but I don't even really know where to start here.
Does anyone have any other suggestions for a solution here that might handily beat the performance of simply iterating an array?
EDIT and MOTIVATION:
Ultimately this comes from a coding theory question. For a given even number d
and word size N
, how many codes with min hamming distance d can I fit into an N-bit number? This allows the creation of a code that can detect errors of d/2
bits correct errors up to d/2-1
bits. We know about Shannon-limit codes like LDPC, but this is for long codes with nebulous min Hamming distance, and they take forever to decode. There are also multi-bit-error codes like OLSC that are fast to decode, but they're far from space-efficient. On the other hand, for d = 4
, extended Hamming (SECDED) codes are optimally compact. I've seen BCH-based methods to make a DECTED code, but I don't know if they're optimal. To explore optimal encodings, what I wanted to do was generate alternative sets of codes of N
bits with some arbitrary d and generate circuits to encode and decode them, selecting the most compact. I was also hoping to find some patterns that we might exploit for longer codes.
If this is (a) not already done, (b) feasible, and (c) someone would like to co-author a paper, please let me know. :)