How to find best fuzzy match for a string in a large string database

P

7

22

I have a database of strings (arbitrary length) which holds more than one million items (potentially more).

I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second.

My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their length.

However, as I will need to perform this operation very often, I'm thinking about building an index of the db strings to keep in memory and query the index, not the db directly.

Any ideas on how to approach this problem differently or how to build the in-memory index?

Pouliot answered 21/11, 2008 at 17:2 Comment(1)

Look into https://mcmap.net/q/81292/-how-to-do-fuzzy-string-matching-of-bigger-than-memory-dictionary-in-an-ordered-key-value-store – Chromatin 30/6, 2020 at 10:42

C

5

This paper seems to describe exactly what you want.

Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.

Cadmann answered 21/11, 2008 at 18:21 Comment(4)

The first link appears to have gone. :-/ – Lefty 14/9, 2009 at 11:37

I emailed a contact, to see if we can track down zarawesome and fix this link. Unfortunately no direct email was provided, so.. – Burweed 22/12, 2009 at 4:37

Sorry, yeah, I don't remember what the paper was about. I suggest you search for "Levenshtein edit distance" and see what comes up. – Cadmann 4/1, 2010 at 4:42

The paper is "Fast String Correction with Levenshtein-Automata" by Klaus U. Schulz , Stoyan Mihov. CiteSeerx link is given below. citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.2940 – Unipod 23/12, 2015 at 10:39

J

3

You didn't mention your database system, but for PostrgreSQL you could use the following contrib module: trgm - Trigram matching for PostgreSQL

The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching.

Jennijennica answered 21/11, 2008 at 18:59 Comment(0)

P

2

If your database supports it, you should use full-text search. Otherwise, you can use an indexer like lucene and its various implementations.

Petty answered 14/12, 2008 at 11:23 Comment(0)

M

0

Since the amount of data is large, when inserting a record I would compute and store the value of the phonetic algorithm in an indexed column and then constrain (WHERE clause) my select queries within a range on that column.

Magnet answered 21/11, 2008 at 17:13 Comment(0)

T

0

Compute the SOUNDEX hash (which is built into many SQL database engines) and index by it.

SOUNDEX is a hash based on the sound of the words, so spelling errors of the same word are likely to have the same SOUNDEX hash.

Then find the SOUNDEX hash of the search string, and match on it.

Tented answered 21/11, 2008 at 17:54 Comment(2)

Soundex cannot see through many misspellings or other variants. It works well on names but not on arbitrary strings. – Sinistral 13/2, 2010 at 12:49

Interesting. I didn't know it was focussed on names. I knew NYIIS was. (en.wikipedia.org/wiki/…) – Tented 13/2, 2010 at 14:41

S

0

A very extensive explanation of relevant algorithms is in the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield.

Sinistral answered 13/2, 2010 at 14:11 Comment(0)

B

0

https://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein algorithm has been implemented in some DBMS

(E.g. PostgreSql: http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html)

Bison answered 10/11, 2015 at 13:29 Comment(0)

Recommended topics

Hot tags