Accessing Spark Mllib Bisecting K-means tree data

Asked 20/1, 2017 at 21:2 Answered 1/2, 2023 at 21:5

Looking over the source code for Bisecting K-means it seems that it builds an internal tree representation of the cluster assignments at each level it progresses. Is it possible to get access to that tree? The built-in methods only give the cluster assignment at the leafs and not the nodes.

Patron answered 20/1, 2017 at 21:2 Comment(1)

Wondering exactly the same. – Indeciduous 8/3, 2018 at 14:15

Follow up on this: has anyone modified the Spark ML source code to be able to store & return the hierarchical clustering tree structure?

I found a GitHub repo with intro to MLlib 1.6's implementation of Bisecting K-means Clustering: https://github.com/yu-iskw/bisecting-kmeans-blog/blob/master/blog-article.md

In the section "What's Next?", the first JIRA ticket [SPARK-11664] "Add methods to get bisecting k-means cluster structure" (https://issues.apache.org/jira/browse/SPARK-11664) seems to be the request to obtain the hierarchical cluster tree structure as a built-in effort. As of today, this ticket status is marked as "resolved".

However, in Spark MLlib's latest implementation (2.4.4) as follows, we didn't find this tree structure, or dendrogram to be a built-in output:

PySpark MLlib 2.4.4 official documentation: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.BisectingKMeans https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.BisectingKMeansModel

Scala MLlib 2.4.4 official documentation: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel

We also looked up into their source code, and it does not seem to have the hierarchical tree structure stored as built-in output?

If the hierarchical clustering tree structure is not available in Spark MLlib 2.4.4 BisectingKMeans, does anyone know if there's modified the source code to get the tree structure available?

Thanks!

Dissect answered 21/11, 2019 at 0:13 Comment(0)

I did not find a way to get this via the pyspark API, but if you save a BisectingKMeansModel to disk, it writes Parquet files that allow you to reconstruct the tree.

From Python I was not able to get this to work using the deprecated pyspark.mllib package, but instead use pyspark.ml. Specifically, pyspark.ml.clustering.BisectingKMeansModel exposes a .save(path) method.

from pyspark.ml.clustering import BisectingKMeans

k=30
bkm = BisectingKMeans(k=k, minDivisibleClusterSize=1.0)
bkm.setMaxIter(10)
model = bkm.fit(examples)
model.save("path/to/saved_model")

Now separately, in Python, I use Pyarrow to load the serialized model from the "data" subdirectory:

from pyarrow import parquet as pq

tabl = pq.read_table("path/to/saved_model/data/")
nodesdf = tabl.to_pandas()
print(nodesdf)

    index  size    center      norm          cost    height    children
0       0   147  {'typ...  4.573446    242.518210  0.000000          []
1       1    88  {'typ...  4.635275    151.815024  0.000000          []
2      -5   228  {'typ...  4.576475    378.479211  0.740183      [0, 1]
3       2    22  {'typ...  4.568550    312.282380  0.000000          []
4      -4   250  {'typ...  4.511299    837.245032  2.464225     [-5, 2]

Each row of the table represents a node in the dendrogram.

Leaf nodes have empty children. Root and inner nodes use the children list to reference the index column of their children rows. (It seems that another way to distinguish the two types of node is whether the index is negative or not.)

The leaf nodes (rows returned by nodesdf.loc[nodesdf["children"].str.len() == 0] correspond to what the model exposes through it's pyspark API:

The set of leaf nodes returned by the following query is equal to what's returned by model.clusterCenters()
The index column of a leaf node corresponds to the cluster id that the model ultimately predicts for a datapoint in the pyspark API (model.transform(examples)). So you could write some code to tie together datapoints to their place in the dendrogram.

Rookery answered 1/2, 2023 at 21:5 Comment(0)

Recommended topics

Hot tags