10 fold cross-validation in one-against-all SVM (using LibSVM)

Asked 24/12, 2012 at 18:45 Answered 26/12, 2012 at 14:36

Solved matlab machine-learning classification svm libsvm

I want to do a 10-fold cross-validation in my one-against-all support vector machine classification in MATLAB.

I tried to somehow mix these two related answers:

But as I'm new to MATLAB and its syntax, I didn't manage to make it work till now.

On the other hand, I saw just the following few lines about cross validation in the LibSVM README files and I couldn't find any related example there:

option -v randomly splits the data into n parts and calculates cross validation accuracy/mean squared error on them.

See libsvm FAQ for the meaning of outputs.

Could anyone provide me an example of 10-fold cross-validation and one-against-all classification?

Gaivn answered 24/12, 2012 at 18:45 Comment(2)

as noted by carlosdc, the second link showcases the SVM functions in the Bioinformatics toolbox (not libsvm) – Dehydrogenase 26/12, 2012 at 14:38

FYI, starting with R2013a, MATLAB's svm functions were moved from Bioinformatics toolbox to the Statistics toolbox (where I think they should have been in the first place!) – Dehydrogenase 19/4, 2013 at 6:3

Mainly there are two reasons we do cross-validation:

as a testing method which gives us a nearly unbiased estimate of the generalization power of our model (by avoiding overfitting)
as a way of model selection (eg: find the best C and gamma parameters over the training data, see this post for an example)

For the first case which we are interested in, the process involves training k models for each fold, and then training one final model over the entire training set. We report the average accuracy over the k-folds.

Now since we are using one-vs-all approach to handle the multi-class problem, each model consists of N support vector machines (one for each class).

The following are wrapper functions implementing the one-vs-all approach:

function mdl = libsvmtrain_ova(y, X, opts)
    if nargin < 3, opts = ''; end

    %# classes
    labels = unique(y);
    numLabels = numel(labels);

    %# train one-against-all models
    models = cell(numLabels,1);
    for k=1:numLabels
        models{k} = libsvmtrain(double(y==labels(k)), X, strcat(opts,' -b 1 -q'));
    end
    mdl = struct('models',{models}, 'labels',labels);
end

function [pred,acc,prob] = libsvmpredict_ova(y, X, mdl)
    %# classes
    labels = mdl.labels;
    numLabels = numel(labels);

    %# get probability estimates of test instances using each 1-vs-all model
    prob = zeros(size(X,1), numLabels);
    for k=1:numLabels
        [~,~,p] = libsvmpredict(double(y==labels(k)), X, mdl.models{k}, '-b 1 -q');
        prob(:,k) = p(:, mdl.models{k}.Label==1);
    end

    %# predict the class with the highest probability
    [~,pred] = max(prob, [], 2);
    %# compute classification accuracy
    acc = mean(pred == y);
end

And here are functions to support cross-validation:

function acc = libsvmcrossval_ova(y, X, opts, nfold, indices)
    if nargin < 3, opts = ''; end
    if nargin < 4, nfold = 10; end
    if nargin < 5, indices = crossvalidation(y, nfold); end

    %# N-fold cross-validation testing
    acc = zeros(nfold,1);
    for i=1:nfold
        testIdx = (indices == i); trainIdx = ~testIdx;
        mdl = libsvmtrain_ova(y(trainIdx), X(trainIdx,:), opts);
        [~,acc(i)] = libsvmpredict_ova(y(testIdx), X(testIdx,:), mdl);
    end
    acc = mean(acc);    %# average accuracy
end

function indices = crossvalidation(y, nfold)
    %# stratified n-fold cros-validation
    %#indices = crossvalind('Kfold', y, nfold);  %# Bioinformatics toolbox
    cv = cvpartition(y, 'kfold',nfold);          %# Statistics toolbox
    indices = zeros(size(y));
    for i=1:nfold
        indices(cv.test(i)) = i;
    end
end

Finally, here is simple demo to illustrate the usage:

%# laod dataset
S = load('fisheriris');
data = zscore(S.meas);
labels = grp2idx(S.species);

%# cross-validate using one-vs-all approach
opts = '-s 0 -t 2 -c 1 -g 0.25';    %# libsvm training options
nfold = 10;
acc = libsvmcrossval_ova(labels, data, opts, nfold);
fprintf('Cross Validation Accuracy = %.4f%%\n', 100*mean(acc));

%# compute final model over the entire dataset
mdl = libsvmtrain_ova(labels, data, opts);

Compare that against the one-vs-one approach which is used by default by libsvm:

acc = libsvmtrain(labels, data, sprintf('%s -v %d -q',opts,nfold));
model = libsvmtrain(labels, data, strcat(opts,' -q'));

Dehydrogenase answered 26/12, 2012 at 14:36 Comment(4)

note that I have renamed libsvm functions to libsvmtrain and libsvmpredict to avoid name collisions with functions with the same name part of the Bioinformatics toolbox (namely svmtrain) – Dehydrogenase 26/12, 2012 at 14:42

In the libsvmtrain_ova function, I get the error Undefined function or method 'libsvmtrain' for input arguments of type 'double'. at this line :models{k} = libsvmtrain(double(y==labels(k)), X, strcat(opts,' -b 1 -q')); – Gaivn 26/12, 2012 at 15:12

@Ezati: as I said in the comment above, I renamed the libsvm MEX functions to avoid confusion with Bioinformatics toolbox. In your case, you could simply replace libsvmtrain with svmtrain and libsvmpredict with svmpredict in my code above. – Dehydrogenase 26/12, 2012 at 15:34

Excuse me, I didn't notice to your comment firstly..now everything is Ok :) Thank you very much, I wish I could give you a +100 – Gaivn 26/12, 2012 at 16:25

It may be confusing you that one of the two questions is not about LIBSVM. You should try to adjust this answer and ignore the other.

You should select the folds, and do the rest exactly as the linked question. Assume the data has been loaded into data and the labels into labels:

n = size(data,1);
ns = floor(n/10);
for fold=1:10,
    if fold==1,
        testindices= ((fold-1)*ns+1):fold*ns;
        trainindices = fold*ns+1:n;
    else
        if fold==10,
            testindices= ((fold-1)*ns+1):n;
            trainindices = 1:(fold-1)*ns;
        else
            testindices= ((fold-1)*ns+1):fold*ns;
            trainindices = [1:(fold-1)*ns,fold*ns+1:n];
         end
    end
    % use testindices only for testing and train indices only for testing
    trainLabel = label(trainindices);
    trainData = data(trainindices,:);
    testLabel = label(testindices);
    testData = data(testindices,:)
    %# train one-against-all models
    model = cell(numLabels,1);
    for k=1:numLabels
        model{k} = svmtrain(double(trainLabel==k), trainData, '-c 1 -g 0.2 -b 1');
    end

    %# get probability estimates of test instances using each model
    prob = zeros(size(testData,1),numLabels);
    for k=1:numLabels
        [~,~,p] = svmpredict(double(testLabel==k), testData, model{k}, '-b 1');
        prob(:,k) = p(:,model{k}.Label==1);    %# probability of class==k
    end

    %# predict the class with the highest probability
    [~,pred] = max(prob,[],2);
    acc = sum(pred == testLabel) ./ numel(testLabel)    %# accuracy
    C = confusionmat(testLabel, pred)                   %# confusion matrix
end

Parodist answered 26/12, 2012 at 7:11 Comment(8)

at the line prob = zeros(numTest,numLabels); you mean ns by numTest. yeah? – Gaivn 26/12, 2012 at 7:56

no, I meant the number of datapoints on which you're testing. I've edited the code. – Parodist 26/12, 2012 at 7:59

So what about -v option? don't we need to use it? – Gaivn 26/12, 2012 at 8:7

From our question, it seems like you need one-vs-all not one one-vs-one (which is -v implements in the case of a multiclass problem) – Parodist 26/12, 2012 at 8:11

But here it says that -v is used for cross validation, not one-vs-one nor one-vs-all. Am I right? – Gaivn 26/12, 2012 at 8:17

cross validation for multiclass can be implemented in one-vs-one or in one-vs-all. LIBSVM does cross validation in multi-class in one-vs-one. – Parodist 26/12, 2012 at 8:49

@carlosdc: note that it is usually preferred to do a stratified cross-validation where each fold contains roughly the same proportion of each class. You can use the CROSSVALIND function from the Bioinformatics Toolbox or the CVPARTITION class from the Statistics Toolbox. Also you might want to keep track of acc for each fold (stored in a vector), so that you can report the average accuracy at the end. – Dehydrogenase 26/12, 2012 at 14:19

@Parodist - Thank you for your effort :) – Gaivn 26/12, 2012 at 16:27

Recommended topics

Hot tags