How to do fuzzy string matching of bigger than memory dictionary in an ordered key-value store?

I am looking for an algorithm and storage schema to do string matching over a bigger than memory dictionary.

My initial attempt, inspired from https://swtch.com/~rsc/regexp/regexp4.html, was to store trigams of every word of the dictionary for instance the word apple is split into $ap, app, ppl, ple and le$ at index time. All of those trigram as associated with the word they came from.

Then I query time, I do the same for the input string that must be matched. I look up every of those trigram in the database and store in the candidate words in mapping associated with the number of matching trigrams in them. Then, I proceed to compute the levenshtein distance between every candidate and apply the following formula:

score(query, candidate) = common_trigram_number(query, candidate) - abs(levenshtein(query, candidate))

There is two problems with this approach, first the candidate selection is too broad. Second the levenshtein distance is too slow to compute.

Fixing the first, could make the second useless to optimize.

I thought about another approach, at index time, instead of storing trigrams, I will store words (possibly associated with frequency). At query time, I could lookup successive prefixes of the query string and score using levenshtein and frequency.

In particular, I am not looking for an algorithm that gives me strings at a distance of 1, 2 etc... I would like to just have a paginated list of more-or-less relevant words from the dictionary. The actual selection is made by the user.

Also it must be possible to represent it in terms of ordered key-value store like rocksdb or wiredtiger.

simhash captures similarity between (small) strings. But it does not really solve the problem of querying most similar string in a bigger than RAM dataset. I think, the original paper recommends to index some permutations, it requires a lot of memory and it does not take advantage of the ordered nature of OKVS.

I think I found a hash that allows to capture similarity inside the prefix of the hash:

In [1]: import fuzz                                                                                      

In [2]: hello = fuzz.bbkh("hello")                                                                       

In [3]: helo = fuzz.bbkh("helo")                                                                         

In [4]: hellooo = fuzz.bbkh("hellooo")                                                                   

In [5]: salut = fuzz.bbkh("salut")                                                                       

In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix                                                      
Out[6]: 213

In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))                                                        
Out[7]: 12

In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))                                                          
Out[8]: 0

After small test over wikidata labels it seems to give good results:

$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france    0
** farrance      -2
** freande   -2
** defrance      -2

real    0m0.054s

$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace   -1
** france    -2
** fernacre      -2

real    0m0.060s

$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium   -2

real    0m0.047s


$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium   0
** ajbelgium     -2

real    0m0.059s

$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium   -1
** beijum    -2

real    0m0.047s

Here is an implementation:

HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)

chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])


def ngram(string, n):
    return [string[i:i+n] for i in range(len(string)-n+1)]


def integer2booleans(integer):
    return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]


def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


def merkletree(booleans):
    assert len(booleans) == HASH_SIZE
    length = (2 * len(booleans) - 1)
    out = [False] * length
    index = length - 1
    booleans = list(reversed(booleans))
    while len(booleans) > 1:
        for boolean in booleans:
            out[index] = boolean
            index -= 1
        new = []
        for (right, left) in chunks(booleans, 2):
            value = right or left
            new.append(value)
        booleans = new
    return out


def bbkh(string):
    integer = 0
    string = "$" + string + "$"
    for gram in ngram(string, 2):
        hotbit = ONE_HOT_ENCODER.index(gram)
        hotinteger = 1 << hotbit
        integer = integer | hotinteger
    booleans = integer2booleans(integer)
    tree = merkletree(booleans)
    fuzz = ''.join('1' if x else '0' for x in tree)
    buzz = int(fuzz, 2)
    hash = buzz.to_bytes(BBKH_LENGTH, 'big')
    return hash


def lcp(a, b):
    """Longest Common Prefix between a and b"""
    out = []
    for x, y in zip(a, b):
        if x == y:
            out.append(x)
        else:
            break
    return ''.join(out)

Evaluation: Using Wikipedia common misspelled words, there is around 8k words. Considering top 10 nearest words, yields 77% success with each query taking around 20ms. Considering top 100, yields 94% success with each query taking less than 200ms. Most mistakes come from joined words (e.g. "abouta" instead of "about a") or words separated with a dash.

Checkout the code at https://github.com/amirouche/fuzzbuzz/blob/master/typofix.py

Note: computing simhash instead of the input string, only works with a bag of lemma or stem, indeed it finds similar documents.

Using a bytes encoding is an optimization. So it is possible to figure what binary representation 0b001 means.

Recommended topics

Hot tags