Python KMeans clustering words

B

4

9

I am interested to perform kmeans clustering on a list of words with the distance measure being Leveshtein.

1) I know there are a lot of frameworks out there, including scipy and orange that has a kmeans implementation. However they all require some sort of vector as the data which doesn't really fit me.

2) I need a good clustering implementation. I looked at python-clustering and realize that it doesn't a) return the sum of all the distance to each centroid, and b) it doesn't have any sort of iteration limit or cut off which ensures the quality of the clustering. python-clustering and the clustering algorithm on daniweb doesn't really work for me.

Can someone find me a good lib? Google hasn't been my friend

Bard answered 17/3, 2010 at 3:29 Comment(1)

I would need exactly the same thing. Have you found anything since then? – Beghard 31/10, 2013 at 21:39

B

1

Yeah I think there isn't a good implementation to what I need.

I have some crazy requirements, like distance caching etc.

So i think i will just write my own lib and release it as GPLv3 soon.

Bard answered 17/3, 2010 at 6:35 Comment(0)

S

0

Not really an answer to your specific question, but I recommend glancing at "Programming Collective Intelligence". At the end of each chapter, e.g., clustering, it wanders off into describing all the best reading on the subject.

Stepchild answered 17/3, 2010 at 6:18 Comment(0)

N

0

Maybe have a look at Weka. It is a Java library with some unsupervised learning implementations and nice visualization tools. It has been a while since I used it, not sure if it is great for a real production environment but defenitely a good starting point.

Nigrosine answered 9/1, 2012 at 11:11 Comment(0)

T

0

What about this very nice answer on CrossValidated?

It uses Affinity Propagation instead of k-means and in that case you can give as input a distance metric. I do not think any k-means based approach could work in your case since it is based on building a centroid and in order to do that you have to be in a vector space.

Affinity Propagation has the bonus that it selects automatically the number of clusters, which you can tweak (to have more or less clusters) by altering the preference (which by default is the median of all pairwise distance, but you can choose other percentiles).

If you need to specify the exact number of clusters, besides tweaking Affinity Propagation by trial and error, you could look for implementation of k-medoids (apparently there is no implementation of it in sklearn, but people have asked for it here and there). K-medoids does not build centroids, so it does not need the concept of vector space. So implementation might accept as input a precomputed distance matrix (haven't checked the references I give, though).

Thoria answered 7/9, 2018 at 7:59 Comment(0)

Recommended topics

Hot tags