Can k-means clustering do classification?
Asked Answered
T

7

22

I want to know whether the k-means clustering algorithm can do classification?

If I have done a simple k-means clustering .

Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.

Cluster A at left side.

Cluster B at right side.

So, if I have one new data. What should I do?

  1. Run k-means clustering algorithm again, and can get which cluster does the new data belong to?

  2. Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?

  3. other method?

Tague answered 10/3, 2014 at 13:0 Comment(0)
T
26

The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).

Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)

However, the following method may be even more reasonable:

3. Train an actual classifier.

Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.

I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.

k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.

Thralldom answered 10/3, 2014 at 16:58 Comment(2)
Why is sum of squares more correct than Euclidean distance? Isn't the difference only the square root, as you have mentioned?Pshaw
The least squares optimum found by k-means is not the least non-squares solution. so to be formally consistent with the k-means optimization problem use the squares.Thralldom
H
8

Yes, we can do classification.

I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).

To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.

There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.

Hujsak answered 10/3, 2014 at 13:5 Comment(4)
so , is my option (2) right ? I need to record the last cluster centroid to determine the closet cluster the new data belong to ?Tague
If by recording the last cluster centroid, you mean you'll remember the most recent centroid for the each cluster, then yes, #2 is correct.Hujsak
You don't benefit much from a kd-tree, unless you have a very large k - you store the centroids only, not the full data set.Thralldom
@Anony-Mousse Not just if you have a very large k, but also just if you have many points to classify (assuming k is fairly large, of course).Hujsak
A
3

If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.

Averroes answered 10/3, 2014 at 13:17 Comment(1)
I agree with Dukeling. After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars. After finding the closest centroid to the new point/sample to be classified, you only know which cluster it belongs to. Here you need a supervisory step to label each cluster. Suppose you label each cluster as C1,C2 and C3 for example. This requires few samples with known labels from these clusters. Once this is done, the closet cluster label becomes the label of the new point/sample and thus it is classified as C1, C2 or C3.Jamestown
F
3

Here another method:

I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:

  1. Apply K-means clustering to the training data in each class seperately, using K clusters per class.
  2. Assign a class label to each of the C*K clusters.
  3. Classify observation x to the class of the closest cluster.

It seems like a nice approach for classification that reduces data observations by using clusters.

Fumble answered 16/10, 2018 at 0:37 Comment(2)
But how can we know the label of cluster?Blase
can use majority votingSorcim
F
1

If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm

Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.

Ferryman answered 12/10, 2017 at 16:47 Comment(0)
P
1

You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.

If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.

Pneumonectomy answered 12/9, 2019 at 5:1 Comment(0)
J
1

After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data. It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

Jamestown answered 29/4, 2020 at 18:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.