Python group similar records (strings) in dataset
Asked Answered
B

1

6

I have an input table like this:

In [182]: data_set
Out[182]: 
       name             ID
0  stackoverflow       123      
1  stikoverflow        322      
2  stack, overflow     411      
3  internet.com        531      
4  internet            112      
5  football            001

And I want to group similar strings based on fuzzywuzzy. So after applying fuzzy matching, all strings with more than some similarity threshold (like > %90 similarity) would group together. So the desired output would be:

In [182]: output
Out[182]: 
       name             ID     group
0  stackoverflow       123       1
1  stikoverflow        322       1
2  stack, overflow     411       1
3  internet.com        531       2
4  internet            112       2
5  football            001       3

I was searching through different topics and I found this and this which are only name matching and not doing clustering. Also this one shows the best match only which it doesn't help me. This page is also explaining about k-means clustering which the number of clusters needs to be set beforehand, which is not practical in this case.

UPDATE:

I figured out process method in fuzzywuzzy package would handle my problem to some extent. But this method only compares string to a list and not list to list:

from fuzzywuzzy import process

with open("data-set.txt", "r") as f:
     data = f.read().split("\n")
process.extract("stackoverflow",data, limit=3)

Output:

[('stackoverflow', 100), ('stack, overflow', 93), ('stikoverflow', 88)]

But still dont know how can I use it to cluster.

Bonheur answered 18/6, 2018 at 22:47 Comment(6)
This is not a clustering problem. It's closer related to spelling correction. For an unsupervised approach, dog and fog are very close. Doggy and foggy are also close. But dog and doggy are much more different. So don't use anything unsupervised!Morissa
I believe at some point we could consider it as a clustering problem since we have a similarity function and the similar strings are grouped together based on some threshold, correct?Bonheur
the example I gave is a counterexample for this hypothesis. You don't have a good enough similarity function. Use something supervised.Morissa
I guess any function I used, I would have some false positives anyhow. I'd like to apply it on millions of records. An end user will evaluate it at the end. I just wanna give them a group of similar records from which they select what they want. What similarity function you recommend?Bonheur
None. There is none that I know that would work. People will likely suggest Levenshtein, but that one really won't work well. Just try it yourself.Morissa
So Fuzzywuzzy is based on Leveneshtein, right?Bonheur
O
-1

This can be accomplished using string-grouper:

    from string_grouper import group_similar_strings
    group_similar_strings(data_set['name'])

string-grouper

Oller answered 20/2, 2021 at 19:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.