How do I predict new data's cluster after clustering training data?
Asked Answered
P

5

17

I have already trained my clustering model using hclust:

 model=hclust(distances,method="ward”)

And the result looks good:

enter image description here

Now I get some new data records, I want to predict which cluster every one of them belongs to. How do I get it done ?

Pigtail answered 11/1, 2014 at 15:48 Comment(4)
What you are describing sounds more like classification. See, for example, the knn(...) function in package class.Repertoire
@MrROY how did you solve the problem using knn? do you have an example?Felicitous
This uses knn rdocumentation.org/packages/arules/versions/1.5-0/topics/…Divergency
See my answer of a similar question: https://mcmap.net/q/348561/-scikit-learn-predicting-new-points-with-dbscanBarthol
T
25

Clustering is not supposed to "classify" new data, as the name suggests - it is the core concept of classification.

Some of the clustering algorithms (like those centroid based - kmeans, kmedians etc.) can "label" new instance based on the model created. Unfortunately hierarchical clustering is not one of them - it does not partition the input space, it just "connects" some of the objects given during clustering, so you cannot assign the new point to this model.

The only "solution" to use the hclust in order to "classify" is to create another classifier on top of the labeled data given by hclust. For example you can now train knn (even with k=1) on the data with labels from hclust and use it to assign labels to new points.

Teddi answered 11/1, 2014 at 18:1 Comment(7)
Great, the knn idea worth trying.Pigtail
Does using the labels of the hierarchal clustering algorithm make sense if you are choosing linkage methods that do not encourage spherical/globular clusters. Single and complete linkage methods seem to not foster spherical/globular clusters. In contrast, average, median, and Ward's method do. You could calculate the centroid or mean of the clusters output from hierarchical clustering and take these as the targets for the classifier, right?Vineyard
you don't need to take any centroids. Just use entire data + labels from clusters and run a classifier on top (e.g. kNN, but any classifier will do). This is a natural way to "extend the structure discovered by the clustering to new points". It doesn't matter how the clusters were discoveredTeddi
good point! Do you think that centroid-based clusters are more amenable to interpretation? I'm struggling to think of a use-case where we would want clusters that represent "real" differences but not easily interpretable. The benefits of adding a classifier only make sense if the labels (from unsupervised learning) are useful.Vineyard
Interpretability is a separate issue. Centroids are interpretable only in the case of highly convex clusters, and over the low dimensional spaces (e.g. tabular data). So it all depends. You can have interpretable decisions without centroids (e.g. by training decision trees instead of KNNs). If your data is low-dim and tabular but clusters are non convex you can use Gaussian Mixture Modeling which will give you effectively multiple templates/centroids and so on. There are many use cases of clustering that are not about interpretability too :) it all dependsTeddi
How would one test the performance of the classifier (which sits 'on top of the labeled data given by hclust')? We only have one set (instance) of labels, which were generated by the (unsupervised) clustering algorithm. It's not as if we continuously get more examples or iterations of these labels.Vineyard
If you don't know any labels then you can't really evaluate anything, no matter how you build it. Evaluation requires knowledge/assesment of reality, period. If you just want to evaluate if classifier "extends the logic of a given clusterer" then all you have to do is to run clustering on the full data, record labels. Then run on a subset, and train classifier on this subset and see if predictions of classifier match what the original clustering of whole data would do.Teddi
C
13

As already mentioned, you can use a classifier such as class :: knn, to determine which cluster a new individual belongs to.

The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance.

Next I leave a code as an example for the iris data.

library(scorecard)
library(factoextra)
library(class)

df_iris <- split_df(iris, ratio = 0.75, seed = 123)
d_iris <- dist(scale(df_iris$train[,-5]))

hc_iris <- hclust(d_iris, method = "ward.D2")
fviz_dend(hc_iris, k = 3,cex = 0.5,k_colors = c("#00AFBB","#E7B800","#FC4E07"),
          color_labels_by_k = TRUE, ggtheme = theme_minimal())
groups <- cutree(hc_iris, k = 3)
table(groups)

enter image description here

Predict new data

knnClust <- knn(train = df_iris$train[,-5], test = df_iris$test[,-5] , k = 1, cl = groups)
knnClust
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 3 3 2 2 2 2 2 3 3 2 2 3 2 2 2 2 2 2 2 2 2
Levels: 1 2 3

# p1 <- fviz_cluster(list(data = df_iris$train[,-5], cluster = groups), stand = F) + xlim(-11.2,-4.8) + ylim(-3,3) + ggtitle("train")
# p2 <- fviz_cluster(list(data = df_iris$test[,-5], cluster = knnClust),stand = F) + xlim(-11.2,-4.8) + ylim(-3,3) + ggtitle("test")
# gridExtra::grid.arrange(p1,p2,nrow = 2)

pca1 <- data.frame(prcomp(df_iris$train[,-5], scale. = T)$x[,1:2], cluster = as.factor(groups), factor = "train")
pca2 <- data.frame(prcomp(df_iris$test[,-5], scale. = T)$x[,1:2], cluster = as.factor(knnClust), factor = "test")
pca <- as.data.frame(rbind(pca1,pca2))

Plot train and test data

ggplot(pca, aes(x = PC1, y = PC2, color = cluster, size = 1, alpha = factor)) +
  geom_point(shape = 19) + theme_bw()

enter image description here

Chameleon answered 27/5, 2019 at 22:41 Comment(0)
L
0

You can use this classification and then use LDA to predict which class the new point should fall into.

Leukorrhea answered 11/1, 2017 at 8:14 Comment(0)
D
0

I face the similar problem and work out a temporal solution.

  1. In my environment R, the function hclust gives the label for the train data.
  2. We can use one supervised learning model to reconnect label and features.
  3. And then we just do the same data processing when we deal with a supervised learning model.
  4. If we face a binary classification model, we can use KS value, AUC value and so on to see the performance of this clustering.

Similarly, we can use PCA method on the feature and extract PC1 as a label.

  1. To binning this label, we get a new label fitted to classification.
  2. In the same way, we do the same processing when we deal with a classification model.

In R, I find PCA method processes much faster than hclust. (Mayank 2016) In practice, I find this way is easy to deploy the model. But I suspect whether this temporal solution results in bias on prediction or not.

Ref

Mayank. 2016. “Hclust() in R on Large Datasets.” Stack Overflow. hclust() in R on large datasets.

Danny answered 5/9, 2018 at 11:25 Comment(0)
D
-5

Why not compute the centroid of the points for each hclust cluster, then assign a new point to the nearest using the same distance function ?

knn in class will only look at nearest n and only allows Euclidean distance.

There's no need to run a classifier.

Divergency answered 21/5, 2015 at 19:32 Comment(2)
because hierarchical clustering does not create clusters where centroid is a well defined object. You are far from truth here, classifier is needed in such a case, 1nn (suggested above) is the simpliest and probably sufficient solution (its code is even simplier than your suggestion) and it will work, while computing centroids will not.Teddi
The above approach is more valid for kmeans. In regards to HCA, I wonder if a tree splitting technique could be employed based on the results of the dendrogram?Divergency

© 2022 - 2024 — McMap. All rights reserved.