Clustering words into groups

Asked 7/12, 2012 at 18:53 Answered 1/7, 2014 at 19:32

This is a Homework question. I have a huge document full of words. My challenge is to classify these words into different groups/clusters that adequately represent the words. My strategy to deal with it is using the K-Means algorithm, which as you know takes the following steps.

Generate k random means for the entire group
Create K clusters by associating each word with the nearest mean
Compute centroid of each cluster, which becomes the new mean
Repeat Step 2 and Step 3 until a certain benchmark/convergence has been reached.

Theoretically, I kind of get it, but not quite. I think at each step, I have questions that correspond to it, these are:

How do I decide on k random means, technically I could say 5, but that may not necessarily be a good random number. So is this k purely a random number or is it actually driven by heuristics such as size of the dataset, number of words involved etc
How do you associate each word with the nearest mean? Theoretically I can conclude that each word is associated by its distance to the nearest mean, hence if there are 3 means, any word that belongs to a specific cluster is dependent on which mean it has the shortest distance to. However, how is this actually computed? Between two words "group", "textword" and assume a mean word "pencil", how do I create a similarity matrix.
How do you calculate the centroid?
When you repeat step 2 and step 3, you are assuming each previous cluster as a new data set?

Lots of questions, and I am obviously not clear. If there are any resources that I can read from, it would be great. Wikipedia did not suffice :(

Litter answered 7/12, 2012 at 18:53 Comment(0)

As you don't know exact number of clusters - I'd suggest you to use a kind of hierarchical clustering:

Imagine that all your words just a points in non-euclidean space. Use Levenshtein distance to calculate distance between words (it works great, in case, if you want to detect clusters of lexicographically similar words)
Build minimum spanning tree which contains all of your words
Remove links, which have length greater than some threshold
Linked groups of words are clusters of similar words

Here is small illustration:

enter image description here

P.S. you can find many papers in web, where described clustering based on building of minimal spanning tree

P.P.S. If you want to detect clusters of semantically similar words, you need some algorithms of automatic thesaurus construction

Gamut answered 8/12, 2012 at 12:6 Comment(2)

I don't know how Lehevenstein distance will help in calculating distance similarity between two words – Litter 9/12, 2012 at 21:24

Thanks for the nudge towards the correct direction. What I did was to create a correlation matrix with the levenshtien distance and then according to a threshold value selected similar words. Its not the best approach but, data being highly unstructured and the model being unsupervised theres not so much i can do – Anthraquinone 3/10, 2016 at 10:26

That you have to choose "k" for k-means is one of the biggest drawbacks of k-means. However, if you use the search function here, you will find a number of questions that deal with the known heuristical approaches to choosing k. Mostly by comparing the results of running the algorithm multiple times.

As for "nearest". K-means acutally does not use distances. Some people believe it uses euclidean, other say it is squared euclidean. Technically, what k-means is interested in, is the variance. It minimizes the overall variance, by assigning each object to the cluster such that the variance is minimized. Coincidentially, the sum of squared deviations - one objects contribution to the total variance - over all dimensions is exactly the definition of squared euclidean distance. And since the square root is monotone, you can also use euclidean distance instead.

Anyway, if you want to use k-means with words, you first need to represent the words as vectors where the squared euclidean distance is meaningful. I don't think this will be easy or maybe not even possible.

Advanced answered 8/12, 2012 at 9:32 Comment(0)

About the distance: In fact, Levenshtein (or edit) distance satisfies triangle inequality. It also satisfies the rest of the necessary properties to become a metric (not all distance functions are metric functions). Therefore you can implement a clustering algorithm using this metric function, and this is the function you could use to compute your similarity matrix S:

-> S_{i,j} = d(x_i, x_j) = S_{j,i} = d(x_j, x_i)

It's worth to mention that the Damerau-Levenshtein distance doesn't satisfy the triangle inequality, so be careful with this.

About the k-means algorithm: Yes, in the basic version you must define by hand the K parameter. And the rest of the algorithm is the same for a given metric.

Gilberte answered 1/7, 2014 at 19:32 Comment(0)

Recommended topics

Hot tags