Why k-means in scikit learn have a predict function but DBSCAN/agglomerative doesnt?
Asked Answered
P

1

7

Scikit-learn implementation of K-means has a predict() function which can be applied on unseen data. Where as DBSCAN and Agglomerative does not have a predict() function.

All the three algorithms has fit_predict() which is used to fit the model and then predict. But k-means has predict() which can be directly used on unseen data which is not the case for the other algorithm.

I am very much aware that there are clustering algorithms and as per my opinion, predict() should not be there for K-means also.

What is the possible intuition/reason behind this discrepancy? is it only because k-means performs "1NN classification", so it has a predict() function?

Pageantry answered 22/7, 2020 at 14:51 Comment(0)
C
6

My interpretation is that the difference comes from the way the cluster are computed. In the KMeans there is a native way to assign a new point to a cluster, while not in DBSCAN or Agglomerative clustering.

A) KMeans

In KMeans, during the construction of the clusters, a data point is assigned to the cluster with the closest centroid, and the centroids are updated afterwards. "Predicting" in the KMeans algorithm is actually doing the assignment step without updating the clusters.

If you assume that the new data points are drawn from the same distribution than the "training" set, and that your "training" set was representative enough, it is reasonable to think that one can assign the new data points following the heuristic of the algorithm without updating the cluster centroids, thus making predictions.

Of course, if the data points distribution is likely to be change one should rerun the KMeans clustering on the updated dataset.

B) DBSCAN

DBSCAN creates the cluster by finding high density areas of the dataset (parametrized by the parameters epsilon and min_points). This is done by computing point-level properties (whether the point is a core point, a directly reachable point, a reachable point or a noise point). Adding a new data point can modify the definition of the neighboring points, and thus make the computed clusters obsolete.

As an example, let's look at this illustration from wikipedia, copied below. On this image there is one cluster (red+yellow points) and one noise point (blue). Red points are core points and yellow points are reachable points.

DBSCAN

and consider two cases:

  • Adding a new point halfway between A and N would make N a reachable point from A and thus belonging to the cluster.
  • Adding (min_points-1) new points in the epsilon-neighborhood of N, but in no other epsilon-neighborhood (as an example at the top of the picture), would change the status of N which would become a core point, and form a new cluster with the newly added points.

Here adding new data points clearly requires to recompute the clusters.

C) Aggglomerative clustering

Agglomerative clustering iteratively builds the cluster starting from points and merges them according to a linkage measure. Similarly to DBSCAN, adding new data points can entirely modify the final clusters because it can trigger different mergings.

As an example, if the linkage strategy you choose in sklearn is "single", clusters are merged if the minimum distance between all elements of the two clusters is below a chosen threshold. You can easily figure out that a single well placed new data point can trigger a merge between two clusters that would have been separated otherwise.

Thus predicting here also requires to recompute the clusters

Cenac answered 23/7, 2020 at 11:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.