I have a use case in my project where I need to compare a key
-string with a lot many strings for similarity. If this value is greater than a certain threshold, I consider those strings "similar" to my key
and based on that list, I do some further calculations / processing.
I have been exploring fuzzy matching string similarity stuff, which use edit distance
based algorithms like "levenshtein, jaro and jaro-winkler" similarities.
Although they work fine, I want to have a higher similarity score if one string is "abbreviation" of another. Is there any algorithm/ implementation I can use for this.
Note:
language: python3
packages explored: fuzzywuzzy, jaro-winkler
Example:
using jaro_winkler similarity:
>>> jaro.jaro_winkler_metric("wtw", "willis tower watson")
0.7473684210526316
>>> jaro.jaro_winkler_metric("wtw", "willistowerwatson")
0.7529411764705883
using levenshtein similarity:
>>> fuzz.ratio("wtw", "willis tower watson")
27
>>> fuzz.ratio("wtw", "willistowerwatson")
30
>>> fuzz.partial_ratio("wtw", "willistowerwatson")
67
>>> fuzz.QRatio("wtw", "willistowerwatson")
30
In these kind of cases, I want score to be higher (>90%) if possible. I'm ok with few false positives as well, as they won't cause too much issue with my further calculations. But if we match s1 and s2 such that s1 is fully contained in s2 (or vice versa), their similarity score should be much higher.
Edit: Further Examples for my Use-Case
For me, spaces are redundant. That means, wtw
is considered abbreviation for "willistowerwatson" and "willis tower watson" alike.
Also, stove
is a valid abbreviation for "STack OVErflow" or "STandardOVErview"
A simple algo would be to start with 1st char of smaller string and see if it is present in the larger one. Then check for 2nd char and so on until the condition satisfies that 1st string is fully contained in 2nd string. This is a 100% match for me.
Further examples like wtwx
to "willistowerwatson" could give a score of, say 80% (this can be based on some edit distance logic). Even if I can find a package which gives either True
or False
for abbreviation similarity would also be helpful.