scikit-learn: Finding the features that contribute to each KMeans cluster
Asked Answered
S

6

22

Say you have 10 features you are using to create 3 clusters. Is there a way to see the level of contribution each of the features have for each of the clusters?

What I want to be able to say is that for cluster k1, features 1,4,6 were the primary features where as cluster k2's primary features were 2,5,7.

This is the basic setup of what I am using:

k_means = KMeans(init='k-means++', n_clusters=3, n_init=10)
k_means.fit(data_features)
k_means_labels = k_means.labels_
Stendhal answered 15/12, 2014 at 19:1 Comment(0)
M
27

You can use

Principle Component Analysis (PCA)

PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering (and normalizing or using Z-scores) the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores (the transformed variable values corresponding to a particular data point), and loadings (the weight by which each standardized original variable should be multiplied to get the component score).

Some essential points:

  • the eigenvalues reflect the portion of variance explained by the corresponding component. Say, we have 4 features with eigenvalues 1, 4, 1, 2. These are the variances explained by the corresp. vectors. The second value belongs to the first principle component as it explains 50 % off the overall variance and the last value belongs to the second principle component explaining 25 % of the overall variance.
  • the eigenvectors are the component's linear combinations. The give the weights for the features so that you can know, which feature as high/low impact.
  • use PCA based on correlation matrix instead of empiric covariance matrix, if the eigenvalues strongly differ (magnitudes).

Sample approach

  • do PCA on entire dataset (that's what the function below does)
    • take matrix with observations and features
    • center it to its average (average of feature values among all observations)
    • compute empiric covariance matrix (e.g. np.cov) or correlation (see above)
    • perform decomposition
    • sort eigenvalues and eigenvectors by eigenvalues to get components with highest impact
    • use components on original data
  • examine the clusters in the transformed dataset. By checking their location on each component you can derive the features with high and low impact on distribution/variance

Sample function

You need to import numpy as np and scipy as sp. It uses sp.linalg.eigh for decomposition. You might want to check also the scikit decomposition module.

PCA is performed on a data matrix with observations (objects) in rows and features in columns.

def dim_red_pca(X, d=0, corr=False):
    r"""
    Performs principal component analysis.

    Parameters
    ----------
    X : array, (n, d)
        Original observations (n observations, d features)

    d : int
        Number of principal components (default is ``0`` => all components).

    corr : bool
        If true, the PCA is performed based on the correlation matrix.

    Notes
    -----
    Always all eigenvalues and eigenvectors are returned,
    independently of the desired number of components ``d``.

    Returns
    -------
    Xred : array, (n, m or d)
        Reduced data matrix

    e_values : array, (m)
        The eigenvalues, sorted in descending manner.

    e_vectors : array, (n, m)
        The eigenvectors, sorted corresponding to eigenvalues.

    """
    # Center to average
    X_ = X-X.mean(0)
    # Compute correlation / covarianz matrix
    if corr:
        CO = np.corrcoef(X_.T)
    else:
        CO = np.cov(X_.T)
    # Compute eigenvalues and eigenvectors
    e_values, e_vectors = sp.linalg.eigh(CO)

    # Sort the eigenvalues and the eigenvectors descending
    idx = np.argsort(e_values)[::-1]
    e_vectors = e_vectors[:, idx]
    e_values = e_values[idx]
    # Get the number of desired dimensions
    d_e_vecs = e_vectors
    if d > 0:
        d_e_vecs = e_vectors[:, :d]
    else:
        d = None
    # Map principal components to original data
    LIN = np.dot(d_e_vecs, np.dot(d_e_vecs.T, X_.T)).T
    return LIN[:, :d], e_values, e_vectors

Sample usage

Here's a sample script, which makes use of the given function and uses scipy.cluster.vq.kmeans2 for clustering. Note that the results vary with each run. This is due to the starting clusters a initialized randomly.

import numpy as np
import scipy as sp
from scipy.cluster.vq import kmeans2
import matplotlib.pyplot as plt

SN = np.array([ [1.325, 1.000, 1.825, 1.750],
                [2.000, 1.250, 2.675, 1.750],
                [3.000, 3.250, 3.000, 2.750],
                [1.075, 2.000, 1.675, 1.000],
                [3.425, 2.000, 3.250, 2.750],
                [1.900, 2.000, 2.400, 2.750],
                [3.325, 2.500, 3.000, 2.000],
                [3.000, 2.750, 3.075, 2.250],
                [2.075, 1.250, 2.000, 2.250],
                [2.500, 3.250, 3.075, 2.250],
                [1.675, 2.500, 2.675, 1.250],
                [2.075, 1.750, 1.900, 1.500],
                [1.750, 2.000, 1.150, 1.250],
                [2.500, 2.250, 2.425, 2.500],
                [1.675, 2.750, 2.000, 1.250],
                [3.675, 3.000, 3.325, 2.500],
                [1.250, 1.500, 1.150, 1.000]], dtype=float)
    
clust,labels_ = kmeans2(SN,3)    # cluster with 3 random initial clusters
# PCA on orig. dataset 
# Xred will have only 2 columns, the first two princ. comps.
# evals has shape (4,) and evecs (4,4). We need all eigenvalues 
# to determine the portion of variance
Xred, evals, evecs = dim_red_pca(SN,2)   

xlab = '1. PC - ExpVar = {:.2f} %'.format(evals[0]/sum(evals)*100) # determine variance portion
ylab = '2. PC - ExpVar = {:.2f} %'.format(evals[1]/sum(evals)*100)
# plot the clusters, each set separately
plt.figure()    
ax = plt.gca()
scatterHs = []
clr = ['r', 'b', 'k']
for cluster in set(labels_):
    scatterHs.append(ax.scatter(Xred[labels_ == cluster, 0], Xred[labels_ == cluster, 1], 
                   color=clr[cluster], label='Cluster {}'.format(cluster)))
plt.legend(handles=scatterHs,loc=4)
plt.setp(ax, title='First and Second Principle Components', xlabel=xlab, ylabel=ylab)
# plot also the eigenvectors for deriving the influence of each feature
fig, ax = plt.subplots(2,1)
ax[0].bar([1, 2, 3, 4],evecs[0])
plt.setp(ax[0], title="First and Second Component's Eigenvectors ", ylabel='Weight')
ax[1].bar([1, 2, 3, 4],evecs[1])
plt.setp(ax[1], xlabel='Features', ylabel='Weight')

Output

The eigenvectors show the weighting of each feature for the component

enter image description here

enter image description here

Short Interpretation

Let's just have a look at cluster zero, the red one. We'll be mostly interested in the first component as it explains about 3/4 of the distribution. The red cluster is in the upper area of the first component. All observations yield rather high values. What does it mean? Now looking at the linear combination of the first component we see on first sight, that the second feature is rather unimportant (for this component). The first and fourth feature are the highest weighted and the third one has a negative score. This means, that - as all red vertices have a rather high score on the first PC - these vertices will have high values in the first and last feature, while at the same time they have low scores concerning the third feature.

Concerning the second feature we can have a look at the second PC. However, note that the overall impact is far smaller as this component explains only roughly 16 % of the variance compared to the ~74 % of the first PC.

Minim answered 21/12, 2014 at 11:6 Comment(0)
P
4

You can do it this way:

>>> import numpy as np
>>> import sklearn.cluster as cl
>>> data = np.array([99,1,2,103,44,63,56,110,89,7,12,37])
>>> k_means = cl.KMeans(init='k-means++', n_clusters=3, n_init=10)
>>> k_means.fit(data[:,np.newaxis]) # [:,np.newaxis] converts data from 1D to 2D
>>> k_means_labels = k_means.labels_
>>> k1,k2,k3 = [data[np.where(k_means_labels==i)] for i in range(3)] # range(3) because 3 clusters
>>> k1
array([44, 63, 56, 37])
>>> k2
array([ 99, 103, 110,  89])
>>> k3
array([ 1,  2,  7, 12])
Perfecto answered 15/12, 2014 at 19:45 Comment(0)
D
2

Try this,

estimator=KMeans()
estimator.fit(X)
res=estimator.__dict__
print res['cluster_centers_']

You will get matrix of cluster and feature_weights, from that you can conclude, the feature having more weight takes major part to contribute cluster.

Deceit answered 21/8, 2017 at 6:19 Comment(1)
cluster_centers_ does not return feature_weights but cluster positions.Lammond
I
1

I assume that by saying "a primary feature" you mean - had the biggest impact on the class. A nice exploration you can do is look at the coordinates of the cluster centers . For example, plot for each feature it's coordinate in each of the K centers.

Of course that any features that are on large scale will have much larger effect on the distance between the observations, so make sure your data is well scaled before performing any analysis.

Instancy answered 16/12, 2014 at 7:20 Comment(1)
On the importance of scaling: scikit-learn.org/dev/auto_examples/preprocessing/…Ceremonial
C
1

a method I came up with is calculating the standard deviation of each feature in relation to the distribution - basically how is the data is spread across each feature

the lesser the spread, the better the feature of each cluster basically:

1 - (std(x) / (max(x) - min(x))

I wrote an article and a class to maintain it

https://github.com/GuyLou/python-stuff/blob/main/pluster.py

https://medium.com/@guylouzon/creating-clustering-feature-importance-c97ba8133c37

Consonantal answered 13/9, 2021 at 19:44 Comment(0)
L
-1

It might be difficult to talk about feature importance separately for each cluster. Rather, it could be better to talk globally about which features are most important for separating different clusters.

For this goal, a very simple method is described as follow. Note that the Euclidean distance between two cluster centers is a sum of square difference between individual features. We can then just use the square difference as the weight for each feature.

Euclidean Distance

Luralurch answered 2/11, 2019 at 15:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.