Abbreviation similarity between strings
Asked Answered
L

2

7

I have a use case in my project where I need to compare a key-string with a lot many strings for similarity. If this value is greater than a certain threshold, I consider those strings "similar" to my key and based on that list, I do some further calculations / processing.

I have been exploring fuzzy matching string similarity stuff, which use edit distance based algorithms like "levenshtein, jaro and jaro-winkler" similarities.

Although they work fine, I want to have a higher similarity score if one string is "abbreviation" of another. Is there any algorithm/ implementation I can use for this.

Note:

language: python3 
packages explored: fuzzywuzzy, jaro-winkler

Example:

using jaro_winkler similarity:

>>> jaro.jaro_winkler_metric("wtw", "willis tower watson")
0.7473684210526316
>>> jaro.jaro_winkler_metric("wtw", "willistowerwatson")
0.7529411764705883

using levenshtein similarity:

>>> fuzz.ratio("wtw", "willis tower watson")
27
>>> fuzz.ratio("wtw", "willistowerwatson")
30
>>> fuzz.partial_ratio("wtw", "willistowerwatson")
67
>>> fuzz.QRatio("wtw", "willistowerwatson")
30

In these kind of cases, I want score to be higher (>90%) if possible. I'm ok with few false positives as well, as they won't cause too much issue with my further calculations. But if we match s1 and s2 such that s1 is fully contained in s2 (or vice versa), their similarity score should be much higher.

Edit: Further Examples for my Use-Case

For me, spaces are redundant. That means, wtw is considered abbreviation for "willistowerwatson" and "willis tower watson" alike.

Also, stove is a valid abbreviation for "STack OVErflow" or "STandardOVErview"

A simple algo would be to start with 1st char of smaller string and see if it is present in the larger one. Then check for 2nd char and so on until the condition satisfies that 1st string is fully contained in 2nd string. This is a 100% match for me.

Further examples like wtwx to "willistowerwatson" could give a score of, say 80% (this can be based on some edit distance logic). Even if I can find a package which gives either True or False for abbreviation similarity would also be helpful.

Lelialelith answered 27/6, 2022 at 2:12 Comment(1)
Check the following answer at StackOverflow: #18872206Discover
J
0

You can use a recursive algorithm, similar to sequence alignment. Just don't give penalty for shifts (as they are expected in abbreviations) but give one for mismatch in first characters.

This one should work, for example:

def abbreviation(abr,word,penalty=1):
    if len(abr)==0:
        return 0
    elif len(word)==0:
        return penalty*len(abr)*-1
    elif abr[0] == word[0]:
        if len(abr)>1:
            return 1 + max(abbreviation(abr[1:],word[1:]),
                           abbreviation(abr[2:],word[1:])-penalty)
        else:
            return 1 + abbreviation(abr[1:],word[1:])
    else:
        return abbreviation(abr,word[1:])

def compute_match(abbr,word,penalty=1):
    score = abbreviation(abbr.lower(),
                         word.lower(),
                         penalty)
    if abbr[0].lower() != word[0].lower(): score-=penalty
    
    score = score/len(abbr)

    return score


print(compute_match("wtw", "willis tower watson"))
print(compute_match("wtwo", "willis tower watson"))
print(compute_match("stove", "Stackoverflow"))
print(compute_match("tov", "Stackoverflow"))
print(compute_match("wtwx", "willis tower watson"))

The output is:

1.0
1.0
1.0
0.6666666666666666
0.5

Indicating that wtw and wtwo are perfectly valid abbreviations for willistowerwatson, that stove is a valid abbreviation of Stackoverflow but not tov, which has the wrong first character. And wtwx is only partially valid abbreviation for willistowerwatson beacuse it ends with a character that does not occur in the full name.

Josh answered 5/7, 2022 at 14:19 Comment(1)
Hi @Martin, this is a great answer. I will be awarding the bounty to this as its the last day as well. This probably solves my use case behaviourally, I'll just have to check on its performance in the large dataset that I have. However, as for the output scores, this is probably exactly what I wanted. Thanks :)Lelialelith
T
2

To detect abbrevioations in string, you can still using fuzzywuzzy module with the process() function:

from fuzzywuzzy import fuzz, process

s1 = ["willis tower watson", "stack overflow", "willistowerwatson", "international business machines"]
s2 = ['wtw', "so", "wtw", "ibz"]

queries = [''.join([i[0] for i in j.split()]) for j in s1]

for query, company in zip(queries, s1):
    print(company, '-', process.extractOne(query, s2, scorer=fuzz.partial_token_sort_ratio))

Output:

willis tower watson - ('wtw', 100)
stack overflow - ('so', 100)
willistowerwatson - ('wtw', 100)
international business machines - ('ibz', 67)
Tabret answered 29/6, 2022 at 7:23 Comment(6)
For me, fuzz.partial_token_sort_ratio("wtw", "willistowerwatson") gives score of 67 and fuzz.partial_token_sort_ratio("wtw", "willis tower watson") gives 33. What is it that we are doing different here?Lelialelith
@Lelialelith You need a different way of processing the string to detect the abbreviation, depending on if it has spaces or not.Tabret
Spaces are redundant here. I would consider "wtw" to be abbreviation of "willistowerwatson" and "willis tower watson" alike. Also, in your example, 3rd record matches "wtw" with "willistowerwatson" with score of 100?!!Lelialelith
@Lelialelith Yes, but how would you make a set of steps to detect "wtw" to be abbreviation of "willistowerwatson" and "willis tower watson" at the same time?Tabret
A simple algo would be to start with 1st char of smaller string and see if it is there in larger one. Then check for 2nd char and so on until I'm able to satisfy the condition that 1st string is fully contained in 2nd string. This is a 100% match for me.Lelialelith
Another example to consider could be: stove is a valid abbreviation for "stack overflow". I hope this example would clear my use case furtherLelialelith
J
0

You can use a recursive algorithm, similar to sequence alignment. Just don't give penalty for shifts (as they are expected in abbreviations) but give one for mismatch in first characters.

This one should work, for example:

def abbreviation(abr,word,penalty=1):
    if len(abr)==0:
        return 0
    elif len(word)==0:
        return penalty*len(abr)*-1
    elif abr[0] == word[0]:
        if len(abr)>1:
            return 1 + max(abbreviation(abr[1:],word[1:]),
                           abbreviation(abr[2:],word[1:])-penalty)
        else:
            return 1 + abbreviation(abr[1:],word[1:])
    else:
        return abbreviation(abr,word[1:])

def compute_match(abbr,word,penalty=1):
    score = abbreviation(abbr.lower(),
                         word.lower(),
                         penalty)
    if abbr[0].lower() != word[0].lower(): score-=penalty
    
    score = score/len(abbr)

    return score


print(compute_match("wtw", "willis tower watson"))
print(compute_match("wtwo", "willis tower watson"))
print(compute_match("stove", "Stackoverflow"))
print(compute_match("tov", "Stackoverflow"))
print(compute_match("wtwx", "willis tower watson"))

The output is:

1.0
1.0
1.0
0.6666666666666666
0.5

Indicating that wtw and wtwo are perfectly valid abbreviations for willistowerwatson, that stove is a valid abbreviation of Stackoverflow but not tov, which has the wrong first character. And wtwx is only partially valid abbreviation for willistowerwatson beacuse it ends with a character that does not occur in the full name.

Josh answered 5/7, 2022 at 14:19 Comment(1)
Hi @Martin, this is a great answer. I will be awarding the bounty to this as its the last day as well. This probably solves my use case behaviourally, I'll just have to check on its performance in the large dataset that I have. However, as for the output scores, this is probably exactly what I wanted. Thanks :)Lelialelith

© 2022 - 2024 — McMap. All rights reserved.