I wish to create a fuzzy search algorithm. However, upon hours of research I am really struggling.
I want to create an algorithm that performs a fuzzy search on a list of names of schools.
This is what I have looked at so far:
Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:
- Levenshtein distance
- Damerau-Levenshtein distance
- Needleman–Wunsch algorithm
However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)
Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.
Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.
What I need?
Can someone please suggest me a good efficient fuzzy search algorithm. with:
- Name of the algorithm
- How it works or a link to how it works
- Pro's and cons and when it's best used (optional)
I understand that all algorithms will have their pros and cons and there is no best algorithm.