How to identify Cluster labels in kmeans scikit learn
Asked Answered
D

2

17

I am learning python scikit. The example given here displays the top occurring words in each Cluster and not Cluster name.

http://scikit-learn.org/stable/auto_examples/document_clustering.html

I found that the km object has "km.label" which lists the centroid id, which is the number.

I have two question

1. How do I generate the cluster labels?
2. How to identify the members of the clusters for further processing.

I have working knowledge of k-means and aware of tf-ids concepts.

Dalmatic answered 5/2, 2015 at 13:0 Comment(1)
I met the same problem. Suppose you have a dataset made of 38 observations (rows) and 5 features (cols). You want 19 clusters. How do you know after kmeans clustering that, for example, observation 24 (row=24) falls in cluster 5?Frosty
E
10
  1. How do I generate the cluster labels?

I'm not sure what you mean by this. You have no cluster labels other than cluster 1, cluster 2, ..., cluster n. That is why it's called unsupervised learning, because there are no labels.

Do you mean you actually have labels and you want to see if the clustering algorithm happened to cluster the data according to your labels?

In that case, the documentation you linked to provides an example:

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
  1. How to identify the members of the clusters for further processing.

See the documentation for KMeans. In particular, the predict method:

predict(X)

Parameters: X : {array-like, sparse matrix}, shape = [n_samples, n_features] New data to predict.

Returns:
labels : array, shape [n_samples,] Index of the cluster each sample belongs to.

If you don't want to predict something new, km.labels_ should do that for the training data.

Electrophoresis answered 5/2, 2015 at 14:8 Comment(1)
I'm facing a related issue. I'm trying to get a confusion matrix among my labels and km.labels_. Here, my labels are string and km.labels_ are integers. So it gives ValueError: Mix of label input types (string and number). Is there a way to get around this?Tanishatanitansy
B
5

Oh that's easy

My environment: scikit-learn version '0.20.0'

Just use the attribute .labels_ as in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

from sklearn.cluster import KMeans
import numpy as np

Working example:

x1 = [[1],[1],[2],[2],[2],[3],[3],[7],[7],[7]]
x2 = [[1],[1],[2],[2],[2],[3],[3],[7],[7],[7]]

X_2D = np.concatenate((x1,x2),axis=1)

kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
labels = kmeans.fit(X_2D)

print(labels.labels_)

Output:

[2 2 3 3 3 0 0 1 1 1]

So as you can see, we have 4 clusters, and each data example in the X_2D array is assigned a label accordingly.

Boxfish answered 7/4, 2020 at 19:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.