Fuzzy search algorithm (approximate string matching algorithm)
Asked Answered
G

5

82

I wish to create a fuzzy search algorithm. However, upon hours of research I am really struggling.

I want to create an algorithm that performs a fuzzy search on a list of names of schools.

This is what I have looked at so far:

Most of my research keep pointing to "string metrics" on Google and Stackoverflow such as:

  • Levenshtein distance
  • Damerau-Levenshtein distance
  • Needleman–Wunsch algorithm

However this just gives a score of how similar 2 strings are. The only way I can think of implementing it as a search algorithm is to perform a linear search and executing the string metric algorithm for each string and returning the strings with scores above a certain threshold. (Originally I had my strings stored in a trie tree, but this obviously won't help me here!)

Although this is not such a bad idea for small lists, it would be problematic for lists with lets say a 100,000 names, and the user performed many queries.

Another algorithm I looked at is the Spell-checker method, where you just do a search for all potential misspellings. However this also is highly inefficient as it requires more than 75,000 words for a word of length 7 and error count of just 2.

What I need?

Can someone please suggest me a good efficient fuzzy search algorithm. with:

  • Name of the algorithm
  • How it works or a link to how it works
  • Pro's and cons and when it's best used (optional)

I understand that all algorithms will have their pros and cons and there is no best algorithm.

Garate answered 1/9, 2015 at 16:58 Comment(5)
Check this out, see if it helps: #491648Numerary
The good ones are complicated, so you might want to consider an off-the-shelf implementation like Lucene.Viscountess
It is possible to create a Ternary tree with a search method that calculates edit distance on the fly as it descends the tree. It's not easy, but I have made one before and it works.Tsana
Look into that answer https://mcmap.net/q/81292/-how-to-do-fuzzy-string-matching-of-bigger-than-memory-dictionary-in-an-ordered-key-value-storePaid
I found several implementations of fuzzy string search algorithms in Python that solve this problem.Recess
E
57

Considering that you're trying to do a fuzzy search on a list of school names, I don't think you want to go for traditional string similarity like Levenshtein distance. My assumption is that you're taking a user's input (either keyboard input or spoken over the phone), and you want to quickly find the matching school.

Distance metrics tell you how similar two strings are based on substitutions, deletions, and insertions. But those algorithms don't really tell you anything about how similar the strings are as words in a human language.

Consider, for example, the words "smith," "smythe," and "smote". I can go from "smythe" to "smith" in two steps:

smythe -> smithe -> smith

And from "smote" to "smith" in two steps:

smote -> smite -> smith

So the two have the same distance as strings, but as words, they're significantly different. If somebody told you (spoken language) that he was looking for "Symthe College," you'd almost certainly say, "Oh, I think you mean Smith." But if somebody said "Smote College," you wouldn't have any idea what he was talking about.

What you need is a phonetic algorithm like Soundex or Metaphone. Basically, those algorithms break a word down into phonemes and create a representation of how the word is pronounced in spoken language. You can then compare the result against a known list of words to find a match.

Such a system would be much faster than using a distance metric. Consider that with a distance metric, you need to compare the user's input with every word in your list to obtain the distance. That is computationally expensive and the results, as I demonstrated with "smith" and "smote" can be laughably bad.

Using a phonetic algorithm, you create the phoneme representation of each of your known words and place it in a dictionary (a hash map or possibly a trie). That's a one-time startup cost. Then, whenever the user inputs a search term, you create the phoneme representation of his input and look it up in your dictionary. That is a lot faster and produces much better results.

Consider also that when people misspell proper names, they almost always get the first letter right, and more often than not pronouncing the misspelling sounds like the actual word they were trying to spell. If that's the case, then the phonetic algorithms are definitely the way to go.

Eduction answered 1/9, 2015 at 17:38 Comment(5)
This looks good. Great suggestion! An issue I have though is that there is no attempt to check for accidental mistypes by pressing an adjacent key e.g. clicking "s" instead of "a". Is there a middle way. [Also Metaphone 3 being commercial is a bit of a put off]. But good suggestion nonethe lessGarate
@YahyaUddin: I wouldn't worry too much about Metaphone 3 being commercial. Soundex works quite well. I've not used Metaphone, so I can't say how well it works.Eduction
@YahyaUddin: Expecting the algorithm to be resilient enough to notice a mistyped letter ('l' for 's', or 's' for 'a', etc.) is more difficult. I suspect that if the algorithm returned no results, or returned completely inappropriate results, the user would notice his mistake and correct it. However, you could do a hybrid approach. Create a trie of all the names and search it if the phonetic algorithm returns nothing or very little. Look into how Scrabble and other word games are able to fill in missing letters or tell you what words can be made from a bag of letters, etc.Eduction
Good advice, now I'm thinking "how to implement this"! Where did you learn this technique of phonetic interpretation from?Spectra
@Spectra Soundex has been around for over 100 years. I learned about it back in the '80s from a computer magazine article. You can find working implementations in many computer languages, and a quick Internet search should give you plenty of references from which to implement it.Eduction
I
10

I wrote an article about how I implemented a fuzzy search:

https://medium.com/@Srekel/implementing-a-fuzzy-search-algorithm-for-the-debuginator-cacc349e6c55

The implementation is in Github and is in the public domain, so feel free to have a look.

https://github.com/Srekel/the-debuginator/blob/master/the_debuginator.h#L1856

The basics of it is: Split all strings you'll be searching for into parts. So if you have paths, then "C:\documents\lol.txt" is maybe "C", "documents", "lol", "txt".

Ensure you lowercase these strings to ensure that you it's case insensitive. (Maybe only do it if the search string is all-lowercase).

Then match your search string against this. In my case I want to match it regardless of order, so "loldoc" would still match the above path even though "lol" comes after "doc".

The matching needs to have some scoring to be good. The most important part I think is consecutive matching, so the more characters directly after one another that match, the better. So "doc" is better than "dcm".

Then you'll likely want to give extra score for a match that's at the start of a part. So you get more points for "doc" than "ocu".

In my case I also give more points for matching the end of a part.

And finally, you may want to consider giving extra points for matching the last part(s). This makes it so that matching the file name/ending scores higher than the folders leading up to it.

Interim answered 9/2, 2018 at 8:46 Comment(0)
G
8

A simple algorithm for "a kind of fuzzy search"

To be honest, in some cases, fuzzy search is mostly useless and I think that a simpler algorithm can improve the search result while providing the feeling that we are still performing a fuzzy search.

Here is my use case: Filtering down a list of countries using "Fuzzy search".

The list I was working with had two countries starting with Z: Zambia and Zimbabwe.

I was using Fusejs.

In this case, when entering the needle "zam", the result set was having 19 matches and the most relevant one for any human (Zambia) at the bottom of the list. And most of the other countries in the result did not even have the letter z in their name.

This was for a mobile app where you can pick a country from a list. It was supposed to be much like when you have to pick a contact from the phone's contacts. You can filter the contact list by entering some term in the search box.

IMHO, this kind of limited content to search from should not be treated in a way that will have people asking "what the heck?!?".

One might suggest to sort by most relevant match. But that's out of the question in this case because the user will then always have to visually find the "Item of Interest" in the reduced list. Keep in mind that this is supposed to be a filtering tool, not a search engine "à la Google". So the result should be sorted in a predictable way. And before filtering, the sorting was alphabetical. So the filtered list should just be an alphabetically sorted subset of the original list.

So I came up with the following algorithm ...

  1. Grab the needle ... in this case: zam
  2. Insert the .* pattern at the beginning and end of the needle
  3. Insert the .* pattern between each letter of the needle
  4. Perform a Regex search in the haystack using the new needle which is now .*z.*a.*m.*

In this case, the user will have a much expected result by finding everything that has somehow the letters z, a and m appearing in this order. All the letters in the needles will be present in the matches in the same order.

This will also match country names like Mozambique ... which is perfect.

I just think that sometimes, we should not try to kill a fly with a bazooka.

Glinys answered 10/9, 2018 at 15:53 Comment(1)
There are also a few regex engines that can do fuzzy matching.Recess
M
7

You're confusing fuzzy search algorithms with implementation: a fuzzy search of a word may return 400 results of all the words that have Levenshtein distance of, say, 2. But, to the user you have to display only the top 5-10.

Implementation-wise, you'll pre-process all the words in the dictionary and save the results into a DB. The popular words (and their fuzzy-likes) will be saved into cache-layer - so you won't have to hit the DB for every request.

You may add an AI layer that will add the most common spelling mistakes and add them to the DB. And etc.

Maddy answered 1/9, 2015 at 17:8 Comment(0)
F
0

The problem can be broken down into two parts:

1) Choosing the correct string metric.

2) Coming up with a fast implementation of the same.

Choosing the correct metric: This part is largely dependent on your use case. However, I would suggest using a combination of a distance-based score and a phonetic-based encoding for greater accuracy i.e. initially computing a score based on the Levenshtein distance and later using Metaphone or Double Metaphone to complement the results.

Again, you should base your decision on your use case. If you can do with using just the Metaphone or Double Metaphone algorithms, then you needn't worry much about the computational cost.

Implementation: One way to cap down the computational cost is to cluster your data into several small groups based on your use case and load them into a dictionary.

For example, If you can assume that your user enters the first letter of the name correctly, you can store the names based on this invariant in a dictionary.

So, if the user enters the name "National School" you need to compute the fuzzy matching score only for school names starting with the letter "N"

Footlocker answered 13/4, 2021 at 7:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.