I want to use silhouette to determine optimal value for k when using KMeans clustering in Spark. Is there any optimal way parallelize this? i.e. make it scalable
No, the silhouette by definition is not scalable.
It uses pairwise distances, this will always take O(n^2) time to compute.
You will need to use something different. Using Silhouette on large data is absurd, it takes much longer to compute the evaluation measure than to run the actual k-means clustering algorithm.
Or reconsider what you are doing. Does it make sense to use the silhouette at all, for example. You could also decide to run something faster than Spark on single nodes, compute the Silhouette there, and simply parallelize via k, without all the overhead of distributed computation. Spark may win against MapReduce-Mahout, but it will lose against a good non-distributed implementation.
Within Set Sum of Squared Errors
spark.apache.org/docs/latest/ml-clustering.html#k-means which also can help determining the number of clusters [as stated here https://mcmap.net/q/80153/-cluster-analysis-in-r-determine-the-optimal-number-of-clusters]? –
Zeba Yes there are ways to make silhouette metric scalable. No its not published all the way as I describe here. Its not that complex so you can understand it too and maybe write it. Just let me know please so I can use it if you write it first.
Looks like we both need to write a high performance silhouette scorer. Input any cluster column vector, giving this scorer capability to work with every clustering implementation. Use mapreduce if possible for easy distributed version as well as shared memory. It looks possible. Page 4 shows the math: http://cran.us.r-project.org/web/packages/clValid/vignettes/clValid.pdf An LSH would help algorithmically since it avoids the exact distance computations that dominate its math. A good LSH implementation would then be essential but I have not found one. Sklearn’s LSHForest is the right idea but not implemented well enough. A simplified silhouette or approximate would be interesting too. The LSH inclusion would result in approximate results. Use LSH capability to find only the nearest point and centroid, which avoids the all-pairs computations. Page 28 of this article has several good suggestions: https://arxiv.org/pdf/1605.01802.pdf It seems to say: Use simplified silhouette not plain silhouette, as follows: Change computation from distance from point to point, to distance from point to cluster centroid. It’s a reduction from all-pairs of points within the cluster and the closest neighbor cluster, which is O(n^2), down to a linear length O(N) computation. Here is my understanding and translation:
Start with:
File of cluster tuples: (clusterID, cluster centroid point)
File of example point tuples: (example point, clusterID). Notice the clusterID is the clusterID column vector described above.
Processing:
For each cluster tuple, run a map(): hashmap[clusterID] = cluster centroid point
For each example point tuple, run:
map(): (dist = distance between point and its cluster centroid point, example point (copy), clusterID(copy))
map(): find closest cluster centroid point to the example point
map(): emit SSI = (distance - minClusterDistance)/minClusterDistance
reduce(): By clusterID emit (clusterID, cluster’s sum of SSI / #points)
I may end up being the implementor. It's crazy nobody has written a fast one like this before. People have done it already in my expectation, but they are keeping them to themselves for competitive purposes (corporate profit, Kaggle placings, etc).
The above is formatted as code but is not code. It is English outline or pseudocode. stackoverflow forced me to format this section as code to accept it.
No, the silhouette by definition is not scalable.
It uses pairwise distances, this will always take O(n^2) time to compute.
You will need to use something different. Using Silhouette on large data is absurd, it takes much longer to compute the evaluation measure than to run the actual k-means clustering algorithm.
Or reconsider what you are doing. Does it make sense to use the silhouette at all, for example. You could also decide to run something faster than Spark on single nodes, compute the Silhouette there, and simply parallelize via k, without all the overhead of distributed computation. Spark may win against MapReduce-Mahout, but it will lose against a good non-distributed implementation.
Within Set Sum of Squared Errors
spark.apache.org/docs/latest/ml-clustering.html#k-means which also can help determining the number of clusters [as stated here https://mcmap.net/q/80153/-cluster-analysis-in-r-determine-the-optimal-number-of-clusters]? –
Zeba ClusteringEvaluator is available since Spark 2.3.0, which compute Silhouette score.
I cannot add a comment but I want to highlight the answer from Yulin GUO:
ClusteringEvaluator is available since Spark 2.3.0, which compute Silhouette score.
This is a scalable, efficient implementation introduced in SPARK-14516.
© 2022 - 2024 — McMap. All rights reserved.