HDBSCAN Python choose number of clusters
Asked Answered
S

2

7

Is is possible to select the number of clusters in the HDBSCAN algorithm in python? Or the only way is to play around with the input parameters such as alpha, min_cluster_size?

Thanks

UPDATE: here is the code to use fcluster and hdbscan

import hdbscan
from scipy.cluster.hierarchy import fcluster

clusterer = hdbscan.HDBSCAN()
clusterer.fit(X)
Z = clusterer.single_linkage_tree_.to_numpy()
labels = fcluster(Z, 2, criterion='maxclust')
Street answered 15/1, 2018 at 18:52 Comment(0)
A
6

Thankfully, on June 2020 a contributor on GitHub (Module for flat clustering) provided a commit that adds code to hdbscan that allows us to choose the number of resulting clusters.

To do so:

from hdbscan import flat

clusterer = flat.HDBSCAN_flat(train_df, n_clusters, prediction_data=True)
flat.approximate_predict_flat(clusterer, points_to_predict, n_clusters)

You can find the code here flat.py You should be able to choose the number of clusters for test points using approximate_predict_flat.

In addition, a jupyter notebook has also been written explaining how to use it, Here.

Alegar answered 12/8, 2021 at 19:26 Comment(0)
S
2

If you explicitly need to get a fixed number of clusters then the closest thing to managing that would be to use the cluster hierarchy and perform a flat cut through the hierarchy at the level that gives you the desired number of clusters. That does involve working with one of the tree objects that HDBSCAN exposes and getting your hands a little dirty, but it can be done.

Semipostal answered 21/1, 2018 at 13:54 Comment(6)
Thanks for you comment. Looking into your suggestion I found that HDBSCAN can be combined with scipy so I can use fcluster in scipy with the criterion 'maxclust' to obtain two clusters! passing the single_linkage_tree_ from HDBSCAN. However, in some cases HDBSCAN does not find two clusters even if the data structure visualy suggest it. I have tried to tune min_samples and min_cluster_size but I dont get the desire result.Street
@Street can you add an answer regarding your approach? I tried supplying the single_linkage tree to fcluster but I always get results where the first clusters contains almost all samples, and the rest have exactly one sample each.Somme
@JouniHelske With HDBSCAN you can do clusterer.single_linkage_tree_.get_clusters(epsilon_value, min_cluster_size=m) to get clusters at a cut level of epsilon_value and exclude any clusters with less than m points.Semipostal
@LelandMcInnes But if I want fixed k number of clusters I should go through the different values of epsilon and see when I end up with k clusters?Somme
@JouniHelske you are referring to the same problem I reported in my first comment. It seems to be related to the nature of HDBSCAN. In my case, the dataset has one extreme vector which was causing the non-intuitive clustering structure, i.e. one single vector being a cluster eventhough visually it is clear two dense clusters. Because of this, I moved to fastcluster and its ward linkage fuction. If you also have extreme points in your dataset you could try to exclude those vectors before running HDBSCAN, and assign them to the closest centroid aftterwards.Street
I believe you can use the tools from scipy.cluster.hierarchy to extract a flat clustering for a fixed number of clusters. The format of the result of clusterer.single_linkage_tree_.to_numpy() can be fed directly to scipy's hierarchical clustering tools.Semipostal

© 2022 - 2024 — McMap. All rights reserved.