Python group similar records (strings) in dataset

I have an input table like this:

In [182]: data_set
Out[182]: 
       name             ID
0  stackoverflow       123      
1  stikoverflow        322      
2  stack, overflow     411      
3  internet.com        531      
4  internet            112      
5  football            001

And I want to group similar strings based on fuzzywuzzy. So after applying fuzzy matching, all strings with more than some similarity threshold (like > %90 similarity) would group together. So the desired output would be:

In [182]: output
Out[182]: 
       name             ID     group
0  stackoverflow       123       1
1  stikoverflow        322       1
2  stack, overflow     411       1
3  internet.com        531       2
4  internet            112       2
5  football            001       3

I was searching through different topics and I found this and this which are only name matching and not doing clustering. Also this one shows the best match only which it doesn't help me. This page is also explaining about k-means clustering which the number of clusters needs to be set beforehand, which is not practical in this case.

UPDATE:

I figured out process method in fuzzywuzzy package would handle my problem to some extent. But this method only compares string to a list and not list to list:

from fuzzywuzzy import process

with open("data-set.txt", "r") as f:
     data = f.read().split("\n")
process.extract("stackoverflow",data, limit=3)

Output:

[('stackoverflow', 100), ('stack, overflow', 93), ('stikoverflow', 88)]

But still dont know how can I use it to cluster.

Recommended topics

Hot tags