How does clustering (especially String clustering) work?
Asked Answered
Z

4

36

I heard about clustering to group similar data. I want to know how it works in the specific case for String.

I have a table with more than different 100,000 words.

I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, @house, "house", etc...).

What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this?

Zashin answered 19/11, 2011 at 18:48 Comment(0)
B
51

To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.

All clustering algorithms are based on the distance (or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hamming and Levenshtein distance. In your particular case Levenshtein distance if more preferable (Hamming distance works only with the strings of same size).

Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.

One of them is expectation maximization algorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.

Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.

Boggers answered 19/11, 2011 at 21:14 Comment(7)
Very nice explanation. Thank you so much. But I have one doubt now. The distance of one string to other must be calculated for another each word? Example: if I have 100 words, I will compare each word with the 99 words? Or this change according to the algorithm (as example, only comparing with the cluster centers)?Zashin
Yes, it depends on the algorithm, but in general most of them compare elements to each other many times. Clustering algorithms by themselves are computationally very difficult (e.g. k-means is a NP-hard task), but many of them have heuristic improvements that make them much easier to perform. See docs for particular algorithm you are interested in.Boggers
k-means on top of EM? Never heard of that. The advice given by e.g. Bishop (''Pattern Recognition and Machine Learning'', Springer 2006) is the exact opposite: EM is better but slow to start, so bootstrap it with a few rounds of k-means optimization.Sitzmark
Also, suggesting EM or k-means in conjunction with string edit distances makes no sense. k-means needs not only a distance metric, but also a well-defined mean of a set of samples, which for strings under edit distance is impossible to define.Sitzmark
@larsmans: EM on top of k-means gives faster convergence and better protection against local minima. k-means on top of EM gives automatic class count discovery and all advantages of k-means. I don't see contradiction here. EM is better than k-means? Sorry, but without concrete task and dataset such a statement makes no sense for me. Anyway, I didn't mean that k-means will definitely work better for string clustering, what I wanted to stress is that one can easily combine both, and Bishop's quote confirms it.Boggers
EM is better in the sense that it produces probability distributions rather than hard class assignments (although the sample->nearest center distance can be exploited to produce soft assignments after the fact).Sitzmark
Also, k-means requires the way to find class center, i.e. an element with smallest average distance to all other elements. Mean is a good way to find it for, say, spacial data, but not the only way. Eventually, you can always directly estimate average distance from each element in cluster to all others. Regarding your last comment: I'm pretty sure EM has a number of advantages over k-means, as well as k-means has many advantages over EM. Probability is not the only valuable thing. E.g. Naive Bayes produces probabilities, but SVM is still more popular nowadays ;)Boggers
B
5

There is a package called stringdist that allows for string comparison using several different methods. Copypasting from that page:

  • Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.
  • Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.
  • (Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.
  • Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.
  • Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
  • q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
  • Cosine distance: 1 minus the cosine similarity of both N-gram vectors.
  • Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
  • Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.
  • Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].

That will give you the distance. You might not need to perform a cluster analysis, perhaps sorting by the string distance itself is sufficient. I have created a script to provide the basic functionality here... feel free to improve it as needed.

Bruit answered 30/11, 2015 at 15:48 Comment(0)
D
0

You can use an algorithm like the Levenshtein distance for the distance calculation and k-means for clustering.

the Levenshtein distance is a string metric for measuring the amount of difference between two sequences

Do some testing and find a similarity threshold per word that will decide your groups.

Dislocation answered 19/11, 2011 at 18:52 Comment(5)
What algorithm is more recommended for string clustering?Zashin
What do you mean more recommended?Dislocation
There are some clustering algorithms, right? Giving the example with house in the question, what algorithm can be more adequate to get this type of result? I want to put all that words inside the same cluster.Zashin
You could use k-means for clustering, using the Levenshtein distance for distance calculation.Dislocation
And how would you compute the means?U
H
0

You can use a clustering algorithm called "Affinity Propagation". This algorithm takes in an input called similarity matrix which you can generate by taking negative of the either Levenstein distance or an harmonic mean of partial_ratio and token_set_ratio from fuzzywuzzy library if you are using Python.

Hemp answered 4/8, 2022 at 17:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.