I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture
to perform clustering of my data set.
I could use the function score()
to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
from sklearn.mixture import GMM clusterer = GMM(5, 'diag') clusterer.fit(X) cluster_labels = clusterer.predict(X)
I see that in order to compute the purity I need the confusion matrix. Now, my problem is that I can't loop through each cluster and count how many objects were classified as each class – BortmanX
? Is it a numpy array? If so, what are its dimensions and what data does it contain? (Notice how I edited that code into the body of your question. Please do that from now on when you have something additional to share) :) – Kisangani