How to group words whose Levenshtein distance is more than 80 percent in Python
Asked Answered
P

1

11

Suppose I have a list:-

person_name = ['zakesh', 'oldman LLC', 'bikash', 'goldman LLC', 'zikash','rakesh']

I am trying to group the list in such a way so the Levenshtein distance between two strings is maximum. For finding out the ratio between two words, I am using a python package fuzzywuzzy.

Examples :-

>>> from fuzzywuzzy import fuzz
>>> combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
>>> fuzz.ratio('goldman LLC', 'oldman LLC')
95
>>> fuzz.ratio('rakesh', 'zakesh')
83
>>> fuzz.ratio('bikash', 'zikash')
83
>>> 

My end goal:

My end goal is to group the words such that Levenshtein distance between them is more than 80 percent?

My list should look something like this :-

person_name = ['bikash', 'zikash', 'rakesh', 'zakesh', 'goldman LLC', 'oldman LLC'] because the distance between `bikash` and `zikash` is very high so they should be together.

Code:

I am trying to achieve this by sorting but key function should be fuzz.ratio. Well below code is not working, but I am approaching the problem in this angle.

from fuzzywuzzy import fuzz
combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.sort(key=lambda x, y: fuzz.ratio(x, y))
print combined_list

Could anyone help me to combine the words so that Levenshtein distance between them is more than 80 percent?

Paeon answered 3/2, 2016 at 8:15 Comment(11)
"combine", "similar", "very high" and "something like this" are all not well defined. You have to clarify your specifications.Conjunctiva
I have edited my question. Group all words such that the Levenshtein distance between them is more than 80%Paeon
And if that's not possible for a given list of words, what should then happen?Conjunctiva
Then just do simple sortingPaeon
I'm confused. I see an optimization problem, i.e. "split this set of words into clusters with maximum internal distance", yet you continue to talk about "sorting", which is "putting the list into order". I'm not sure how those two mix. Do you only want the distance between two successive strings to be maximum?Kong
In python 2 you can pass a comparison function cmp to sort (and related functions). That has been removed from Python 3, but there is a workaround: see cmp_to_key.Ogdoad
I am trying to group all words such that the Levenshtein distance between them is more than 80 percent? I was initially thinking about sorting, but if you could suggest any other approach I can try thatPaeon
As for the clustering problem (if it is one): Sounds like maximum clique. The nodes are the words and you have an edge between words, if the edit distance is >= 80% between those words. The rest is NP-complete, take an approximation algorithm of your choosing, networkx implements some.Kong
I have deleted my previous comment! I don't think cmp_to_key is working for me. Could anyone suggest some approaches :/ :/Paeon
@Kong Can we use the group by function in python to achieve the result ?Paeon
@Paeon For group by, you need a function that determines the group association of a word based on the word itself. Problem here is: If you can add a word to a group depends on all the elements that are already in the group. It's a combinatorial optimization problem and as far as I can tell, maximum clique fits the bill.Kong
A
14

This groups the names

from fuzzywuzzy import fuzz

combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.append('bakesh')
print('input names:', combined_list)

grs = list() # groups of names with distance > 80
for name in combined_list:
    for g in grs:
        if all(fuzz.ratio(name, w) > 80 for w in g):
            g.append(name)
            break
    else:
        grs.append([name, ])

print('output groups:', grs)
outlist = [el for g in grs for el in g]
print('output list:', outlist)

producing

input names: ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC', 'bakesh']
output groups: [['rakesh', 'zakesh', 'bakesh'], ['bikash', 'zikash'], ['goldman LLC', 'oldman LLC']]
output list: ['rakesh', 'zakesh', 'bakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']

As you can see, the names are grouped correctly, but the order may not be the one you desire.

Ameliaamelie answered 3/2, 2016 at 9:34 Comment(4)
Isn't this only one (greedy) approach and the resulting groups may not be optimal?Kong
@Kong yes of course. I never said it's optimal nor the only one. Please do provide a better/alternative solution.Ameliaamelie
I just wanted to mention it, but I'll try to give an optimal (maximum groups) when I get around. Nonetheless, your solution is of course correct and returns valid clusters.Kong
This is essentially clustering with a single cluster, and incrementally adding elements. Note that it's order-dependent; as we add more elements to the cluster (and they become spread out), it becomes harder to satisfy the 'add' condtions all(fuzz.ratio(name, w) > 80 for w in g). So if you shuffled combined_list before running, you could get different results.Epigenesis

© 2022 - 2024 — McMap. All rights reserved.