Looking over the source code for Bisecting K-means it seems that it builds an internal tree representation of the cluster assignments at each level it progresses. Is it possible to get access to that tree? The built-in methods only give the cluster assignment at the leafs and not the nodes.
Follow up on this: has anyone modified the Spark ML source code to be able to store & return the hierarchical clustering tree structure?
I found a GitHub repo with intro to MLlib 1.6's implementation of Bisecting K-means Clustering: https://github.com/yu-iskw/bisecting-kmeans-blog/blob/master/blog-article.md
In the section "What's Next?", the first JIRA ticket [SPARK-11664] "Add methods to get bisecting k-means cluster structure" (https://issues.apache.org/jira/browse/SPARK-11664) seems to be the request to obtain the hierarchical cluster tree structure as a built-in effort. As of today, this ticket status is marked as "resolved".
However, in Spark MLlib's latest implementation (2.4.4) as follows, we didn't find this tree structure, or dendrogram to be a built-in output:
PySpark MLlib 2.4.4 official documentation: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.BisectingKMeans https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.BisectingKMeansModel
Scala MLlib 2.4.4 official documentation: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel
We also looked up into their source code, and it does not seem to have the hierarchical tree structure stored as built-in output?
If the hierarchical clustering tree structure is not available in Spark MLlib 2.4.4 BisectingKMeans, does anyone know if there's modified the source code to get the tree structure available?
Thanks!
I did not find a way to get this via the pyspark API, but if you save a BisectingKMeansModel
to disk, it writes Parquet files that allow you to reconstruct the tree.
From Python I was not able to get this to work using the deprecated pyspark.mllib
package, but instead use pyspark.ml
. Specifically, pyspark.ml.clustering.BisectingKMeansModel
exposes a .save(path)
method.
from pyspark.ml.clustering import BisectingKMeans
k=30
bkm = BisectingKMeans(k=k, minDivisibleClusterSize=1.0)
bkm.setMaxIter(10)
model = bkm.fit(examples)
model.save("path/to/saved_model")
Now separately, in Python, I use Pyarrow to load the serialized model from the "data" subdirectory:
from pyarrow import parquet as pq
tabl = pq.read_table("path/to/saved_model/data/")
nodesdf = tabl.to_pandas()
print(nodesdf)
index size center norm cost height children
0 0 147 {'typ... 4.573446 242.518210 0.000000 []
1 1 88 {'typ... 4.635275 151.815024 0.000000 []
2 -5 228 {'typ... 4.576475 378.479211 0.740183 [0, 1]
3 2 22 {'typ... 4.568550 312.282380 0.000000 []
4 -4 250 {'typ... 4.511299 837.245032 2.464225 [-5, 2]
Each row of the table represents a node in the dendrogram.
Leaf nodes have empty children
. Root and inner nodes use the children
list to reference the index
column of their children rows. (It seems that another way to distinguish the two types of node is whether the index
is negative or not.)
The leaf nodes (rows returned by nodesdf.loc[nodesdf["children"].str.len() == 0]
correspond to what the model exposes through it's pyspark API:
- The set of leaf nodes returned by the following query is equal to what's returned by
model.clusterCenters()
- The
index
column of a leaf node corresponds to the cluster id that the model ultimately predicts for a datapoint in the pyspark API (model.transform(examples)
). So you could write some code to tie together datapoints to their place in the dendrogram.
© 2022 - 2024 — McMap. All rights reserved.