simhash captures similarity between (small) strings. But it does not really solve the problem of querying most similar string in a bigger than RAM dataset. I think, the original paper recommends to index some permutations, it requires a lot of memory and it does not take advantage of the ordered nature of OKVS.
I think I found a hash that allows to capture similarity inside the prefix of the hash:
In [1]: import fuzz
In [2]: hello = fuzz.bbkh("hello")
In [3]: helo = fuzz.bbkh("helo")
In [4]: hellooo = fuzz.bbkh("hellooo")
In [5]: salut = fuzz.bbkh("salut")
In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix
Out[6]: 213
In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))
Out[7]: 12
In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))
Out[8]: 0
After small test over wikidata labels it seems to give good results:
$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france 0
** farrance -2
** freande -2
** defrance -2
real 0m0.054s
$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace -1
** france -2
** fernacre -2
real 0m0.060s
$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium -2
real 0m0.047s
$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium 0
** ajbelgium -2
real 0m0.059s
$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium -1
** beijum -2
real 0m0.047s
Here is an implementation:
HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)
chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])
def ngram(string, n):
return [string[i:i+n] for i in range(len(string)-n+1)]
def integer2booleans(integer):
return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
def merkletree(booleans):
assert len(booleans) == HASH_SIZE
length = (2 * len(booleans) - 1)
out = [False] * length
index = length - 1
booleans = list(reversed(booleans))
while len(booleans) > 1:
for boolean in booleans:
out[index] = boolean
index -= 1
new = []
for (right, left) in chunks(booleans, 2):
value = right or left
new.append(value)
booleans = new
return out
def bbkh(string):
integer = 0
string = "$" + string + "$"
for gram in ngram(string, 2):
hotbit = ONE_HOT_ENCODER.index(gram)
hotinteger = 1 << hotbit
integer = integer | hotinteger
booleans = integer2booleans(integer)
tree = merkletree(booleans)
fuzz = ''.join('1' if x else '0' for x in tree)
buzz = int(fuzz, 2)
hash = buzz.to_bytes(BBKH_LENGTH, 'big')
return hash
def lcp(a, b):
"""Longest Common Prefix between a and b"""
out = []
for x, y in zip(a, b):
if x == y:
out.append(x)
else:
break
return ''.join(out)
Evaluation: Using Wikipedia common misspelled words, there is around 8k words. Considering top 10 nearest words, yields 77% success with each query taking around 20ms. Considering top 100, yields 94% success with each query taking less than 200ms. Most mistakes come from joined words (e.g. "abouta" instead of "about a") or words separated with a dash.
Checkout the code at https://github.com/amirouche/fuzzbuzz/blob/master/typofix.py
Note: computing simhash instead of the input string, only works with a bag of lemma or stem, indeed it finds similar documents.
Using a bytes encoding is an optimization. So it is possible to figure what binary representation 0b001 means.