How to get the centroids in DBSCAN sklearn?

Asked 5/6, 2020 at 12:58 Answered 27/7, 2022 at 1:22

Solved python scikit-learn cluster-analysis dbscan

I am using DBSCAN for clustering. However, now I want to pick a point from each cluster that represents it, but I realized that DBSCAN does not have centroids as in kmeans.

However, I observed that DBSCAN has something called core points. I am thinking if it is possible to use these core points or any other alternative to obtain a representative point from each cluster.

I have mentioned below the code that I have used.

import numpy as np
from math import pi
from sklearn.cluster import DBSCAN

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

#generate distance matrix from each point
dist = points_rad[None,:] - points_rad[:, None]

#Assign shortest distances from each point
dist[((dist > pi) & (dist <= (2*pi)))] = dist[((dist > pi) & (dist <= (2*pi)))] -(2*pi)
dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] = dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] + (2*pi) 
dist = abs(dist)

#check dist
print(dist)

#using default values, set metric to 'precomputed'
db = DBSCAN(eps=((100 / (24*60)) * 2 * pi ), min_samples = 2, metric='precomputed')

#check db
print(db)

db.fit(dist)

#get labels
labels = db.labels_

#get number of clusters
no_clusters = len(set(labels)) - (1 if -1 in labels else 0)

print('No of clusters:', no_clusters)
print('Cluster 0 : ', np.nonzero(labels == 0)[0])
print('Cluster 1 : ', np.nonzero(labels == 1)[0])

print(db.core_sample_indices_)

I am happy to provide more details if needed.

Ganja answered 5/6, 2020 at 12:58 Comment(1)

Just in case you don't know: Kmeans is a centroid-based method (each cluster is just a centroid and all points belong to the nearest centroid). DBSCAN is density-based, so the resulting clusters can have any shape, as long as there are points close enough to each other. So DBSCAN could also result in a "ball"-cluster in the center with a "circle"-cluster around it. Both clusters would have the same "centroid" in that case, which is the reason why computing centroids for DBSCAN results can be highly misleading. So take care when working with those centroids (or use a centroid-based method). – Monanthous 6/6, 2020 at 8:39

Why don't you estimate the centroids of the resulted estimated clusters?

points_of_cluster_0 = dist[labels==0,:]
centroid_of_cluster_0 = np.mean(points_of_cluster_0, axis=0) 
print(centroid_of_cluster_0)

points_of_cluster_1 = dist[labels==1,:]
centroid_of_cluster_1 = np.mean(points_of_cluster_1, axis=0)
print(centroid_of_cluster_1)

Frohman answered 5/6, 2020 at 14:20 Comment(3)

what do you mean? in which dataset/question? – Frohman 23/11, 2021 at 15:16

Think about two points: 0.0, -179.0 and 0.0, 179.0. The centroid of these is 0.0, 0.0, which is very distant from them. – Authority 24/11, 2021 at 14:8

Oh yes. my answer is about euclidean coordinates. You need to find another way for GPS coordinates. Convert them to other systems. best – Frohman 25/11, 2021 at 8:47

Maybe, do KDE row by row like (e.g. density_i = np.where(cdist(x[i:i+1],x[inds])-cut_off<0,1,0).sum(1)) for each cluster {i.e., i in inds, where inds=np.argwhere(cluster_results==cluster_index)} and find the point with highest density in each cluster; that is the most representative centroid. This may still can be slow if dataset is massive.

Libido answered 27/7, 2022 at 1:22 Comment(1)

NB; as mentioned in above comment; non Euclidean dataset q needs to first be represented/featurized/mapped to Euclidean coordinate system x:=map(q), even before going into DBSCAN. [In terms of the two GPS coordinates one of those (around the equator one) is mapped to a 2D circle (by [sin,cos](q[:,0])) and the other one (north to south) probably to a semi-circle (by [cos](q[:,1])), so x is 3D.] – Libido 27/7, 2022 at 1:44

Recommended topics

Hot tags