I have a relative straightforward thing I want to do:
Q
, a query distance d
, and a set of numbers S
, determine whether or not S
contains any numbers with Hamming distance less than or equal to d
.The simplest solution is to just make S
a list and iterate over it, computing distances. If a distance less than or equal d is computed, bail out an return TRUE
.
But considering that all I want to do is check for an existence, something faster than a linear time solution should be possible.
One thing I tried is an M-tree
. Referencing some other questions on stackoverflow, the wikipedia article (https://en.wikipedia.org/wiki/M-tree) and two pre-existing implementations, I spent several hours yesterday implementing a custom solution. One of the nice things about this problem is that it's actually cheaper to compute popcount over the XOR of two numbers (using an SSE instruction) than to store numbers that would allow avoidance of computing the metric, so there are several aspects of the solution that could be simplified and optimized for speed.
The results were very disappointing. It turns out that the metric radius I'm dealing with is small compared to the minimum Hamming distance. For instance, in the space of 12 bit numbers, the maximum Hamming distance is 12. If the minimum I'm looking for is 4, that doesn't leave much opportunity for good non-overlapping partitioning. In fact, I tried just that, creating by brute force a set of 12-bit numbers with min Hamming distance of 4 and then (by brute force) finding optimal binary tree partitioning so that a search algorithm could visit a minimum number of nodes. If I want to count the number of set elements within d of the query, I can't reduce the number of node visits below about 30% of the total, and stopping when I find the first has it visit about 4%. This means that I've more or less made a linear-time solution, where the overhead of the elaborate tree search algorithm is about the same as the savings from not having to check as many set members.
But what I want to do is very limited. I don't want to even count the number of set members with query distance <= d
, much less enumerate them. I just want to check for existence. This makes me think about things like bloom filters and hashes.
I've also thought about trying to build a graph structure where set members are connected by edges with weights. Using the fact that Hamming distance respects triangle inequality, it seems to me there must be some way to search this graph such that edge traversals lead in a direction of likely smaller distance to the query, but I don't even really know where to start here.
Does anyone have any other suggestions for a solution here that might handily beat the performance of simply iterating an array?
EDIT and MOTIVATION:
Ultimately this comes from a coding theory question. For a given even number d
and word size N
, how many codes with min hamming distance d can I fit into an N-bit number? This allows the creation of a code that can detect errors of d/2
bits correct errors up to d/2-1
bits. We know about Shannon-limit codes like LDPC, but this is for long codes with nebulous min Hamming distance, and they take forever to decode. There are also multi-bit-error codes like OLSC that are fast to decode, but they're far from space-efficient. On the other hand, for d = 4
, extended Hamming (SECDED) codes are optimally compact. I've seen BCH-based methods to make a DECTED code, but I don't know if they're optimal. To explore optimal encodings, what I wanted to do was generate alternative sets of codes of N
bits with some arbitrary d and generate circuits to encode and decode them, selecting the most compact. I was also hoping to find some patterns that we might exploit for longer codes.
If this is (a) not already done, (b) feasible, and (c) someone would like to co-author a paper, please let me know. :)
I think that problem may be resolved by the splitting each numbers from S to substrings such that the query results must have at least 1 partition whose Hamming distance is no more than 1 with the corresponding partitions of the query.
This algorithm is described in the article: Alex X. Liu, Ke Shen, Eric Torng. Large scale Hamming distance query processing, 2011. The authors are called the algorithm as HEngine. I try to explain some intuition.
Lets N - bit count of the number (it dimensionality)
k - query Hamming distance
r-cut(α) - function of splitting number α into r substring {α1, α2, ..., αr} where the first r − (m mod r) substrings have length ⌊m/r⌋ and the last m mod r substrings have length ⌈m/r⌉
The algorithm is based on the theorem:
For any two binary strings β and γ such that HD(β, γ) ≤ k, consider r-cut(β) and r-cut(γ) where r ≥ ⌊k/2⌋ + 1. It must be the case that HD(βi, γi) ≤ 1 for at least q = r − ⌊k/2⌋ different values of i.
For example, we have binary string of length N = 8 bits. And we would like to find substrings with k = 2.
α = 10001110
β = 10100110
HD(α, β) = 2
Then minimum value of r = ⌊2/2⌋ + 1 = 2. In this case r-cut(α,β) produces 2 substrings of length 4 bits:
α1 = 1000 α2 = 1110
β1 = 1010 β2 = 0110
HD(α1, β1) = 1, HD(α2, β2) = 1
q = 2 - ⌊2/2⌋ = 1.
Also the authors introduced the next theorem:
Consider any string β ∈ T such that HD(α, β) ≤ k. Given any r ≥ ⌊k/2⌋ + 1, it follows that at least one signature β-signature matches its compatible signature α-signature.
The basic idea of the algorithm is to preprocess S to facilitate finding all strings β in S that satisfy the signature match property and then verify which of these strings actually are within Hamming distance k of α.
I suppose you should prepare the set of S to subtables using HEngine algorithm, and split Q to partitions the same way. And then perform the search by corresponding partitions taking into account that the Hamming distance is no more than 1 with the corresponding partitions.
Please I advise you to see more details in the article.