Python Clustering 'purity' metric

Asked 2/12, 2015 at 16:14 Answered 12/7, 2019 at 19:31

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.

I could use the function score() to compute the log probability under the model.

However, I am looking for a metric called 'purity' which is defined in this article.

How can I implement it in Python? My current implementation looks like this:

from sklearn.mixture import GMM

# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)

clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)

# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)

But I can not loop through each cluster in order to compute the confusion matrix (according this question)

Bortman answered 2/12, 2015 at 16:14 Comment(6)

That paper is pretty opaque. This answer on crossvalidated simplifies the procedure a bit. – Kisangani 2/12, 2015 at 16:23

Please post the code you have so far, and tell us about the data structures involved. – Kisangani 2/12, 2015 at 16:29

At the moment, my code is: from sklearn.mixture import GMM clusterer = GMM(5, 'diag') clusterer.fit(X) cluster_labels = clusterer.predict(X) I see that in order to compute the purity I need the confusion matrix. Now, my problem is that I can't loop through each cluster and count how many objects were classified as each class – Bortman 2/12, 2015 at 16:41

Alright. And what is X? Is it a numpy array? If so, what are its dimensions and what data does it contain? (Notice how I edited that code into the body of your question. Please do that from now on when you have something additional to share) :) – Kisangani 2/12, 2015 at 18:16

Yes, it's a NumPy array (1000L, 2L). Data are extracted from MNIST dataset (200 examples for 5 classes) and I read them as a float type. Then, I computed the PCA in order to reduce the dimensionality and now my task is to cluster X using GMM varying the number of clusters and to compute purity for every choice of number of clusters. – Bortman 2/12, 2015 at 19:25

You say "my problem is that I can't loop through each cluster and count...", but it's difficult to help with just that information. Please show us the problematic code and describe the problem by editing it into your question. – Kisangani 2/12, 2015 at 22:4

David's answer works but here is another way to do it.

import numpy as np
from sklearn import metrics

def purity_score(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    # return purity
    return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)

Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".

Arri answered 3/8, 2018 at 12:30 Comment(1)

This is a very concise answer, even better using scikit-learn own functions! – Aromatic 12/8, 2018 at 16:20

sklearn doesn't implement a cluster purity metric. You have 2 options:

Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.

Kisangani answered 2/12, 2015 at 16:29 Comment(0)

A very late contribution.

You can try to implement it like this, pretty much like in this gist

def purity_score(y_true, y_pred):
    """Purity score
        Args:
            y_true(np.ndarray): n*1 matrix Ground truth labels
            y_pred(np.ndarray): n*1 matrix Predicted clusters

        Returns:
            float: Purity score
    """
    # matrix which will hold the majority-voted labels
    y_voted_labels = np.zeros(y_true.shape)
    # Ordering labels
    ## Labels might be missing e.g with set like 0,2 where 1 is missing
    ## First find the unique labels, then map the labels to an ordered set
    ## 0,2 should become 0,1
    labels = np.unique(y_true)
    ordered_labels = np.arange(labels.shape[0])
    for k in range(labels.shape[0]):
        y_true[y_true==labels[k]] = ordered_labels[k]
    # Update unique labels
    labels = np.unique(y_true)
    # We set the number of bins to be n_classes+2 so that 
    # we count the actual occurence of classes between two consecutive bins
    # the bigger being excluded [bin_i, bin_i+1[
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)

    for cluster in np.unique(y_pred):
        hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
        # Find the most present label in the cluster
        winner = np.argmax(hist)
        y_voted_labels[y_pred==cluster] = winner

    return accuracy_score(y_true, y_voted_labels)

Aromatic answered 6/7, 2017 at 0:21 Comment(1)

Hi @Hadij, it does not always gives zeros, but indeed it has major flaws. I was notified (see comments that it didn't work when the true labels were unordered or/and not starting by zero. I have updated the function, feedbacks are appreciated – Aromatic 19/5, 2018 at 21:58

The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.

For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.

Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:

import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment

def cluster_accuracy(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)

    # Find optimal one-to-one mapping between cluster labels and true labels
    row_ind, col_ind = linear_sum_assignment(-contingency_matrix)

    # Return cluster accuracy
    return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)

Dysphagia answered 12/7, 2019 at 19:31 Comment(3)

I appreciate your concern. However, purity is not based on a one-to-one mapping because, generally, the numbers of clusters and classes are different. It is true that a trivial way to achieve a score of purity of 1 is by putting each data point in its own cluster. For more, please see Introduction to Information Retrieval (book) or how-to-calculate-purity – Arri 18/7, 2019 at 20:4

Sorry -- on closer reading, your code does correctly implement the purity metric described in the link – Dysphagia 18/7, 2019 at 21:40

linear_sum_assignment only works if the number of clusters is equal to the number of classes. otherwise, surplus classes/clusters are dropped. – Constructionist 21/9, 2021 at 14:6

Recommended topics

Hot tags