I stumbled on a similar problem a couple of months ago while doing clustering. I did not search for built in solutions very long (although I am sure they must exist) and ended up writing my own little script for matching my found labels the best with the ground truth. The code is very crude, but it should get you started.
It is based on trying all possible rearrangements of the labels to see witch best fit the truth vector. That means that given a clustering result yte = [3 3 2 1]
with ground truth y = [1 1 2 3]
, the script will try to match [3 3 2 1], [3 3 1 2], [2 2 3 1], [2 2 1 3], [1 1 2 3] and [1 1 3 2]
with y
to find the best match.
This is based on using the built in script perms()
witch can not handle more than 10 unique clusters. The code can also tend to be slow for 7-10 unique clusters, as the complexity grows as a factorial.
function [accuracy, true_labels, CM] = calculateAccuracy(yte, y)
%# Function for calculating clustering accuray and matching found
%# labels with true labels. Assumes yte and y both are Nx1 vectors with
%# clustering labels. Does not support fuzzy clustering.
%#
%# Algorithm is based on trying out all reorderings of cluster labels,
%# e.g. if yte = [1 2 2], try [1 2 2] and [2 1 1] so see witch fit
%# the truth vector the best. Since this approach makes use of perms(),
%# the code will not run for unique(yte) greater than 10, and it will slow
%# down significantly for number of clusters greater than 7.
%#
%# Input:
%# yte - result from clustering (y-test)
%# y - truth vector
%#
%# Output:
%# accuracy - Overall accuracy for entire clustering (OA). For
%# overall error, use OE = 1 - OA.
%# true_labels - Vector giving the label rearangement witch best
%# match the truth vector (y).
%# CM - Confusion matrix. If unique(yte) = 4, produce a
%# 4x4 matrix of the number of different errors and
%# correct clusterings done.
N = length(y);
cluster_names = unique(yte);
accuracy = 0;
maxInd = 1;
perm = perms(unique(y));
[pN pM] = size(perm);
true_labels = y;
for i=1:pN
flipped_labels = zeros(1,N);
for cl = 1 : pM
flipped_labels(yte==cluster_names(cl)) = perm(i,cl);
end
testAcc = sum(flipped_labels == y')/N;
if testAcc > accuracy
accuracy = testAcc;
maxInd = i;
true_labels = flipped_labels;
end
end
CM = zeros(pM,pM);
for rc = 1 : pM
for cc = 1 : pM
CM(rc,cc) = sum( ((y'==rc) .* (true_labels==cc)) );
end
end
Example:
[acc newLabels CM] = calculateAccuracy([3 2 2 1 2 3]',[1 2 2 3 3 3]')
acc =
0.6667
newLabels =
1 2 2 3 2 1
CM =
1 0 0
0 2 0
1 1 1