I have an input table like this:
In [182]: data_set
Out[182]:
name ID
0 stackoverflow 123
1 stikoverflow 322
2 stack, overflow 411
3 internet.com 531
4 internet 112
5 football 001
And I want to group similar strings based on fuzzywuzzy. So after applying fuzzy matching, all strings with more than some similarity threshold (like > %90 similarity) would group together. So the desired output would be:
In [182]: output
Out[182]:
name ID group
0 stackoverflow 123 1
1 stikoverflow 322 1
2 stack, overflow 411 1
3 internet.com 531 2
4 internet 112 2
5 football 001 3
I was searching through different topics and I found this and this which are only name matching and not doing clustering. Also this one shows the best match only which it doesn't help me. This page is also explaining about k-means clustering which the number of clusters needs to be set beforehand, which is not practical in this case.
UPDATE:
I figured out process
method in fuzzywuzzy
package would handle my problem to some extent. But this method only compares string to a list and not list to list:
from fuzzywuzzy import process
with open("data-set.txt", "r") as f:
data = f.read().split("\n")
process.extract("stackoverflow",data, limit=3)
Output:
[('stackoverflow', 100), ('stack, overflow', 93), ('stikoverflow', 88)]
But still dont know how can I use it to cluster.