How to get the samples in each cluster?
Asked Answered
A

7

51

I am using the sklearn.cluster KMeans package. Once I finish the clustering if I need to know which values were grouped together how can I do it?

Say I had 100 data points and KMeans gave me 5 cluster. Now I want to know which data points are in cluster 5. How can I do that.

Is there a function to give the cluster id and it will list out all the data points in that cluster?

Alten answered 24/3, 2016 at 7:56 Comment(2)
I just provided an answer addressing your question. Let me know if this helps.Equality
you can use .labels_ to checkTonetic
D
53

I had a similar requirement and i am using pandas to create a new dataframe with the index of the dataset and the labels as columns.

data = pd.read_csv('filename')

km = KMeans(n_clusters=5).fit(data)

cluster_map = pd.DataFrame()
cluster_map['data_index'] = data.index.values
cluster_map['cluster'] = km.labels_

Once the DataFrame is available is quite easy to filter, For example, to filter all data points in cluster 3

cluster_map[cluster_map.cluster == 3]
Diestock answered 29/4, 2017 at 14:33 Comment(3)
there is no need to use pandasEquality
When learning new models, I seem to struggle with this last part of returning the modeled data back to the original source. Most tutorials do not show that. Thank you for your answer.Urania
@Diestock Are you sure that it is going to be indexed correctly? Does your solution preserve order of rows when reconstructing dataframe from km.labels_ as it was before clustering?Predacious
V
22

If you have a large dataset and you need to extract clusters on-demand you'll see some speed-up using numpy.where. Here is an example on the iris dataset:

from sklearn.cluster import KMeans
from sklearn import datasets
import numpy as np

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

km = KMeans(n_clusters=3)
km.fit(X)

Define a function to extract the indices of the cluster_id you provide. (Here are two functions, for benchmarking, they both return the same values):

def ClusterIndicesNumpy(clustNum, labels_array): #numpy 
    return np.where(labels_array == clustNum)[0]

def ClusterIndicesComp(clustNum, labels_array): #list comprehension
    return np.array([i for i, x in enumerate(labels_array) if x == clustNum])

Let's say you want all samples that are in cluster 2:

ClusterIndicesNumpy(2, km.labels_)
array([ 52,  77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
       115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
       134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])

Numpy wins the benchmark:

%timeit ClusterIndicesNumpy(2,km.labels_)

100000 loops, best of 3: 4 µs per loop

%timeit ClusterIndicesComp(2,km.labels_)

1000 loops, best of 3: 479 µs per loop

Now you can extract all of your cluster 2 data points like so:

X[ClusterIndicesNumpy(2,km.labels_)]

array([[ 6.9,  3.1,  4.9,  1.5], 
       [ 6.7,  3. ,  5. ,  1.7],
       [ 6.3,  3.3,  6. ,  2.5], 
       ... #truncated

Double-check the first three indices from the truncated array above:

print X[52], km.labels_[52]
print X[77], km.labels_[77]
print X[100], km.labels_[100]

[ 6.9  3.1  4.9  1.5] 2
[ 6.7  3.   5.   1.7] 2
[ 6.3  3.3  6.   2.5] 2
Virge answered 25/3, 2016 at 18:27 Comment(0)
S
9

Actually a very simple way to do this is:

clusters=KMeans(n_clusters=5)
df[clusters.labels_==0]

The second row returns all the elements of the df that belong to the 0th cluster. Similarly you can find the other cluster-elements.

Streak answered 19/3, 2020 at 17:7 Comment(2)
This is elegant, but I wonder if there is a way to retrieve the indexes of the elements in df that has label 0 in this case.Dumbarton
@Dumbarton df[clusters.labels_==0].index or df.index[clusters.labels_==0]Mighell
E
5

To get the IDs of the points/samples/observations that are inside each cluster, do this:

Python 2

Example using Iris data and a nice pythonic way:

import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets

np.random.seed(0)

# Use Iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target

# KMeans with 3 clusters
clf =  KMeans(n_clusters=3)
clf.fit(X,y)

#Coordinates of cluster centers with shape [n_clusters, n_features]
clf.cluster_centers_
#Labels of each point
clf.labels_

# Nice Pythonic way to get the indices of the points for each corresponding cluster
mydict = {i: np.where(clf.labels_ == i)[0] for i in range(clf.n_clusters)}

# Transform this dictionary into list (if you need a list as result)
dictlist = []
for key, value in mydict.iteritems():
    temp = [key,value]
    dictlist.append(temp)

RESULTS

#dict format
{0: array([ 50,  51,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
            64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
            78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
            91,  92,  93,  94,  95,  96,  97,  98,  99, 101, 106, 113, 114,
           119, 121, 123, 126, 127, 133, 138, 142, 146, 149]),
 1: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
           17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
           34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
 2: array([ 52,  77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
           115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
           134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])}

# list format
[[0, array([ 50,  51,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
             64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
             78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
             91,  92,  93,  94,  95,  96,  97,  98,  99, 101, 106, 113, 114,
             119, 121, 123, 126, 127, 133, 138, 142, 146, 149])],
 [1, array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
            34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49])],
 [2, array([ 52,  77, 100, 102, 103, 104, 105, 107, 108, 109, 110, 111, 112,
             115, 116, 117, 118, 120, 122, 124, 125, 128, 129, 130, 131, 132,
             134, 135, 136, 137, 139, 140, 141, 143, 144, 145, 147, 148])]]

Python 3

Just change

for key, value in mydict.iteritems():

to

for key, value in mydict.items():
Equality answered 11/6, 2018 at 18:38 Comment(2)
For those who are working with python3 and encountering a problem with this solution, you just need to change iteritems() to items()Sech
Indeed my answer is in python2. I am going to updated now for python3 as well. cheersEquality
M
3

You can look at attribute labels_

For example

km = KMeans(2)
km.fit([[1,2,3],[2,3,4],[5,6,7]])
print km.labels_
output: array([1, 1, 0], dtype=int32)

As you can see first and second point is cluster 1, last point in cluster 0.

Mousseline answered 24/3, 2016 at 9:7 Comment(2)
Yes this method would work. but when there are lot of data point iterating through all of them to get the labels is not efficient right. I just was the list of data points for a given cluster. Isn't there another way to do this?Alten
@Alten see the answer that I just postedEquality
M
0

You can Simply store the labels in an array. Convert the array to a data frame. Then Merge the data that you used to create K means with the new data frame with clusters.

Display the dataframe. Now you should see the row with corresponding cluster. If you want to list all the data with specific cluster, use something like data.loc[data['cluster_label_name'] == 2], assuming 2 your cluster for now.

Maryalice answered 11/6, 2018 at 14:57 Comment(0)
M
0

If you want to stay in the pandas realm, a more idiomatic solution compared to Praveen's could look like this:

from sklearn.cluster import KMeans
import pandas as pd
kmeans = KMeans(n_clusters=5, n_init="auto").fit(data)
labels = pd.Series(kmeans.labels_, index=data.index, name='cluster')
clustered_data = pd.concat([data, labels], axis=1)
clustered_data.query("cluster == 3")

Combining the original data with the new column of clustering labels is the preferred approach when using a plotting library such as plotly.express which takes an entire dataframe and a mapping from visualization channels to column names, as in

import plotly.express as px
px.scatter(clustered_data, x="pca0", y="pca1", color="cluster")
Mighell answered 12/1 at 22:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.