Using Silhouette Clustering in Spark

G

4

10

I want to use silhouette to determine optimal value for k when using KMeans clustering in Spark. Is there any optimal way parallelize this? i.e. make it scalable

Glace answered 6/8, 2015 at 18:24 Comment(0)

T

5

No, the silhouette by definition is not scalable.

It uses pairwise distances, this will always take O(n^2) time to compute.

You will need to use something different. Using Silhouette on large data is absurd, it takes much longer to compute the evaluation measure than to run the actual k-means clustering algorithm.

Or reconsider what you are doing. Does it make sense to use the silhouette at all, for example. You could also decide to run something faster than Spark on single nodes, compute the Silhouette there, and simply parallelize via k, without all the overhead of distributed computation. Spark may win against MapReduce-Mahout, but it will lose against a good non-distributed implementation.

Tupiguarani answered 6/8, 2015 at 19:37 Comment(9)

Perhaps recommend an alternative method for selecting K that works better in a distributed environment? – Glace 6/8, 2015 at 19:42

I did not find any method to work well, nor is k-means applicable for most data sets. Also, all data sets where k-means can be used tend to be tiny anyway. How big is yours? – Tupiguarani 6/8, 2015 at 20:28

~50 TBs but I'm not clustering the entire thing, no more then 1GB at a time – Glace 6/8, 2015 at 20:41

Also, would't any evaluation of clustering involve a similar complexity? For example Dunn or Davies-Bouldin index? – Glace 6/8, 2015 at 21:36

1 GB fits into memory of a single host. Use a single-node approach, it WILL be faster, trust me. Many of these indexes will have quadratic complexity, yes. But they won't yield meaningful results anyway, so who cares? – Tupiguarani 6/8, 2015 at 21:48

Do not assume they really tell you anything about the result quality. They won't. They will return you some rather meaningless number that is all they do. Blindly chosing the result based on that number is as good as picking any at random. (So is running k-means on a data set that you do not understand well, btw. - do you understand you data?) – Tupiguarani 6/8, 2015 at 21:50

Yes I do, however I want to create real-time clustering analysis and the naturally occurring clusters vary from case to case. This is why I need some automated method. Why do you believe these cluster evaluation metrics are meaningless? Are you of the opinion that there is no way to automatically choose k value (and that it must always be done by humans)? – Glace 6/8, 2015 at 21:58

Yes. I am convinced there is no way to automate k-means or clustering in general. It is all about fiddling the buttons and by hand looking at the outcomes. No turnkey solution ever worked. – Tupiguarani 6/8, 2015 at 22:6

Have you considered using already implemented in MLlib Within Set Sum of Squared Errors spark.apache.org/docs/latest/ml-clustering.html#k-means which also can help determining the number of clusters [as stated here https://mcmap.net/q/80153/-cluster-analysis-in-r-determine-the-optimal-number-of-clusters]? – Zeba 16/8, 2016 at 13:2

N

6

Yes there are ways to make silhouette metric scalable. No its not published all the way as I describe here. Its not that complex so you can understand it too and maybe write it. Just let me know please so I can use it if you write it first.

Looks like we both need to write a high performance silhouette scorer. Input any cluster column vector, giving this scorer capability to work with every clustering implementation. Use mapreduce if possible for easy distributed version as well as shared memory. It looks possible. Page 4 shows the math: http://cran.us.r-project.org/web/packages/clValid/vignettes/clValid.pdf An LSH would help algorithmically since it avoids the exact distance computations that dominate its math. A good LSH implementation would then be essential but I have not found one. Sklearn’s LSHForest is the right idea but not implemented well enough. A simplified silhouette or approximate would be interesting too. The LSH inclusion would result in approximate results. Use LSH capability to find only the nearest point and centroid, which avoids the all-pairs computations. Page 28 of this article has several good suggestions: https://arxiv.org/pdf/1605.01802.pdf It seems to say: Use simplified silhouette not plain silhouette, as follows: Change computation from distance from point to point, to distance from point to cluster centroid. It’s a reduction from all-pairs of points within the cluster and the closest neighbor cluster, which is O(n^2), down to a linear length O(N) computation. Here is my understanding and translation:

Start with:
File of cluster tuples: (clusterID, cluster centroid point) 
File of example point tuples: (example point, clusterID). Notice the clusterID is the clusterID column vector described above.
Processing:
For each cluster tuple, run a map(): hashmap[clusterID] = cluster centroid point
For each example point tuple, run:
map(): (dist = distance between point and its cluster centroid point, example point (copy), clusterID(copy))
map(): find closest cluster centroid point to the example point
map(): emit SSI = (distance - minClusterDistance)/minClusterDistance
reduce(): By clusterID emit (clusterID, cluster’s sum of SSI / #points)

I may end up being the implementor. It's crazy nobody has written a fast one like this before. People have done it already in my expectation, but they are keeping them to themselves for competitive purposes (corporate profit, Kaggle placings, etc).

The above is formatted as code but is not code. It is English outline or pseudocode. stackoverflow forced me to format this section as code to accept it.

Nasturtium answered 20/12, 2016 at 2:11 Comment(1)

Wanted to thank Geoffrey Anderson for his suggestion, I ended up implementing something slightly more elaborate than the suggestion in the Multiple KMeans++ paper, but my investigation started from there. I have written a blog about the implementation (using Spark RDD based Scala code) here -- sujitpal.blogspot.com/2018/03/… – Cloris 4/3, 2018 at 17:32