How can I match up cluster labels to my 'ground truth' labels in Matlab

Asked 27/7, 2012 at 8:3 Answered 14/1, 2022 at 15:16

I have searched here and googled, but to no avail. When clustering in Weka there is a handy option, classes to clusters, which matches up the clusters produced by the algorithm e.g. simple k-means, to the 'ground truth' class labels you supply as the class attribute. So that we can see cluster accuracy (% incorrect).

Now, how can I achieve this in Matlab, i.e. translate my clusterClasses vector e.g. [1, 1, 2, 1, 3, 2, 3, 1, 1, 1] into the same index as the supplied ground truth labels vector e.g. [2, 2, 2, 3, 1, 3]?

I think it is probably based on cluster centres and label centres, but I'm not sure how to implement!

Any help would be greatly appreciated.

Vincent

Kerch answered 27/7, 2012 at 8:3 Comment(1)

Assuming your number of clusters is the same as the number of labels you have for the ground truth's then I believe you seek to score your model according to a metric like ARI. – Nightstick 5/6, 2016 at 1:0

I stumbled on a similar problem a couple of months ago while doing clustering. I did not search for built in solutions very long (although I am sure they must exist) and ended up writing my own little script for matching my found labels the best with the ground truth. The code is very crude, but it should get you started.

It is based on trying all possible rearrangements of the labels to see witch best fit the truth vector. That means that given a clustering result yte = [3 3 2 1] with ground truth y = [1 1 2 3], the script will try to match [3 3 2 1], [3 3 1 2], [2 2 3 1], [2 2 1 3], [1 1 2 3] and [1 1 3 2] with y to find the best match.

This is based on using the built in script perms() witch can not handle more than 10 unique clusters. The code can also tend to be slow for 7-10 unique clusters, as the complexity grows as a factorial.

function [accuracy, true_labels, CM] = calculateAccuracy(yte, y)
%# Function for calculating clustering accuray and matching found 
%# labels with true labels. Assumes yte and y both are Nx1 vectors with
%# clustering labels. Does not support fuzzy clustering.
%#
%# Algorithm is based on trying out all reorderings of cluster labels, 
%# e.g. if yte = [1 2 2], try [1 2 2] and [2 1 1] so see witch fit 
%# the truth vector the best. Since this approach makes use of perms(),
%# the code will not run for unique(yte) greater than 10, and it will slow
%# down significantly for number of clusters greater than 7.
%#
%# Input:
%#   yte - result from clustering (y-test)
%#   y   - truth vector
%#
%# Output:
%#   accuracy    -   Overall accuracy for entire clustering (OA). For
%#                   overall error, use OE = 1 - OA.
%#   true_labels -   Vector giving the label rearangement witch best 
%#                   match the truth vector (y).
%#   CM          -   Confusion matrix. If unique(yte) = 4, produce a
%#                   4x4 matrix of the number of different errors and  
%#                   correct clusterings done.

N = length(y);

cluster_names = unique(yte);
accuracy = 0;
maxInd = 1;

perm = perms(unique(y));
[pN pM] = size(perm);

true_labels = y;

for i=1:pN
    flipped_labels = zeros(1,N);
    for cl = 1 : pM
        flipped_labels(yte==cluster_names(cl)) = perm(i,cl);
    end

    testAcc = sum(flipped_labels == y')/N;
    if testAcc > accuracy
        accuracy = testAcc;
        maxInd = i;
        true_labels = flipped_labels;
    end

end

CM = zeros(pM,pM);
for rc = 1 : pM
    for cc = 1 : pM
        CM(rc,cc) = sum( ((y'==rc) .* (true_labels==cc)) );
    end
end

Example:

[acc newLabels CM] = calculateAccuracy([3 2 2 1 2 3]',[1 2 2 3 3 3]')

acc =

0.6667


newLabels =

 1     2     2     3     2     1


CM =

 1     0     0
 0     2     0
 1     1     1

Exostosis answered 27/7, 2012 at 8:53 Comment(2)

Thank You! This seems to work a treat. I'm using 9 clusters, but the factorial doesn't seem a problem. Cheers :) – Kerch 27/7, 2012 at 9:57

I managed to find a much faster implementation, maybe this will help someone someday! mathworks.com/matlabcentral/fileexchange/… – Kerch 30/7, 2012 at 19:10

You might want to look into more flexible way of evaluating clusters. For example pair counting metrics.

The "class = cluster" assumption is typical for people coming from machine learning into clustering. But instead you should assume that some classes may consist of multiple clusters, or that multiple classes actually cluster. These are the interesting situations a clustering algorithm should actually detect.

Primordium answered 27/7, 2012 at 20:55 Comment(0)

I needed this exact thing for Python and converted the code posted by Vidar (accepted answer). I share the code to anyone interested. I renamed the variables and removed the confusion matrix (most libraries for machine learnine have built in functions for that anyway). I noticed the faster implementation linked by Vincent (http://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement) too late. Probably better adapt that one for Python.

#tested with python 3.6
def remap_labels(pred_labels, true_labels):
    """Rename prediction labels (clustered output) to best match true labels."""
    # from itertools import permutations # import this into script.
    pred_labels, true_labels = np.array(pred_labels), np.array(true_labels)
    assert pred_labels.ndim == 1 == true_labels.ndim
    assert len(pred_labels) == len(true_labels)
    cluster_names = np.unique(pred_labels)
    accuracy = 0

    perms = np.array(list(permutations(np.unique(true_labels))))

    remapped_labels = true_labels
    for perm in perms:
        flipped_labels = np.zeros(len(true_labels))
        for label_index, label in enumerate(cluster_names):
            flipped_labels[pred_labels == label] = perm[label_index]

        testAcc = np.sum(flipped_labels == true_labels) / len(true_labels)
        if testAcc > accuracy:
            accuracy = testAcc
            remapped_labels = flipped_labels

    return accuracy, remapped_labels

Pelham answered 11/1, 2020 at 1:54 Comment(0)

Following the idea of @Vidar I have written my own code for finding the best label correspondence. I assume that the initial labels are integers $0..K$ and find the renaming of labels that gives most matches. The code relies on function itertools.permutations (suggested by this answer):

import numpy as np
from itertools import permutations

def match_labels(labels, labels0, K):
    perms = list(permutations(range(K)))
    nmatch = 0
    for jperm, perm in enumerate(perms):
        labs = np.array([perm[l] for l in labels])
        newmatch = (labs == labels0).sum()
        if newmatch > nmatch:
            nmatch = newmatch
            newlabels = labs
            matchtable = list(perm)
    return newlabels, matchtable

This thread suggests more advanced techniques for verifying the quality of clsutering. The first to try is probably Rand index, available in many libraries (e.g., in scikit-learn).

Drastic answered 14/1, 2022 at 15:16 Comment(0)

Recommended topics

Hot tags