Get the cluster size in sklearn in python

Asked 11/9, 2017 at 12:17 Answered 11/9, 2017 at 17:17

Solved python machine-learning scikit-learn cluster-analysis dbscan

I am using sklearn DBSCAN to cluster my data as follows.

#Apply DBSCAN (sims == my data as list of lists)
db1 = DBSCAN(min_samples=1, metric='precomputed').fit(sims)

db1_labels = db1.labels_
db1n_clusters_ = len(set(db1_labels)) - (1 if -1 in db1_labels else 0)
#Returns the number of clusters (E.g., 10 clusters)
print('Estimated number of clusters: %d' % db1n_clusters_)

Now I want to get the top 3 clusters sorted from the size (number of data points in each cluster). Please let me know how to obtain the cluster size in sklearn?

Overabound answered 11/9, 2017 at 12:17 Comment(0)

Another option would be to use numpy.unique:

db1_labels = db1.labels_
labels, counts = np.unique(db1_labels[db1_labels>=0], return_counts=True)
print labels[np.argsort(-counts)[:3]]

Holothurian answered 11/9, 2017 at 17:17 Comment(0)

Well you can Bincount Function in Numpy to get the frequencies of labels. For example, we will use the example for DBSCAN using scikit-learn:

#Store the labels
labels = db.labels_

#Then get the frequency count of the non-negative labels
counts = np.bincount(labels[labels>=0])

print counts
#Output : [243 244 245]

Then to get the top 3 values use argsort in numpy. In our example since there are only 3 clusters, I will extract the top 2 values :

top_labels = np.argsort(-counts)[:2]

print top_labels
#Output : [2 1]

#To get their respective frequencies
print counts[top_labels]

Kirst answered 11/9, 2017 at 16:56 Comment(2)

Thank you for your very useful answer. Please let me know how to get the cluster labels of the 245 and 244 clusters? – Overabound 12/9, 2017 at 0:19

'The variable top_label will contain the labels of the 245 and 244 clusters respectively in that order. Also, if you found my answer useful, can you please mark it as the correct one :-) – Kirst 12/9, 2017 at 2:26

Another option would be to use numpy.unique:

db1_labels = db1.labels_
labels, counts = np.unique(db1_labels[db1_labels>=0], return_counts=True)
print labels[np.argsort(-counts)[:3]]

Holothurian answered 11/9, 2017 at 17:17 Comment(0)

Recommended topics

Hot tags