scaling the testing data for LIBSVM: MATLAB implementation
Asked Answered
T

2

10

I currently use the MATLAB version of the LIBSVM support vector machine to classify my data. The LIBSVM documentation mentions that scaling before applying SVM is very important and we have to use the same method to scale both training and testing data.

The "same method of scaling" is explained as: For example, suppose that we scaled the first attribute of training data from [-10, +10] to [-1, +1]. If the first attribute of testing data lies in the range [-11, +8], we must scale the testing data to [-1.1, +0.8]

Scaling the training data in the range of [0,1] can be done using the following MATLAB code :

(data - repmat(min(data,[],1),size(data,1),1))*spdiags(1./(max(data,[],1)-min(data,[],1))',0,size(data,2),size(data,2))

But I don't know how to scale the testing data correctly.

Thank you very much for your help.

Terzetto answered 7/4, 2012 at 14:44 Comment(1)
my question is that if train data in range of [a,b] normalized to the range [0,1], the test data in the range of [c,d] normalized to which range?Terzetto
M
16

The code you give is essentially subtracting the minimum and then dividing by the range. You need to store the minimum and range of the training data features.

minimums = min(data, [], 1);
ranges = max(data, [], 1) - minimums;

data = (data - repmat(minimums, size(data, 1), 1)) ./ repmat(ranges, size(data, 1), 1);

test_data = (test_data - repmat(minimums, size(test_data, 1), 1)) ./ repmat(ranges, size(test_data, 1), 1);
Maximinamaximize answered 7/4, 2012 at 14:50 Comment(3)
@Richante: Your answer is very useful. I just want to clarify, "data" here is the training data and "test_data" is the testing data??Erythrite
#43408531Mezzosoprano
I'm sorry but your code will output NaN for the columns for which all of the observations has the same value (which may happen if your data is sparse). For example, data = [1 2 3; 5 2 8; 7 2 100]Auteur
A
0

Richante's code is, unfortunately, not correct if there are columns for which all of the observations has the same value (which may happen if the data is sparse). An example:

>> data = [1 2 3; 5 2 8; 7 2 100]

data =

     1     2     3
     5     2     8
     7     2   100

>> test_data = [1 2 3; 4 5 6; 7 8 9];
>> minimums = min(data,[],1);
>> ranges = max(data, [], 1) - minimums;
>> data = (data - repmat(minimums, size(data, 1), 1)) ./ repmat(ranges, size(data, 1), 1);
>> data

data =

         0       NaN         0
    0.6667       NaN    0.0515
    1.0000       NaN    1.0000

So you have to check if there are columns which has only one single value. But what if there is only one single value in the entire training set, but there are several values in the test set? And what do we do in the Leave-one-out scenario, in which there is only one observation in the test set, then if all the values in a column of the training set is 0, and the corresponding value in the test set is 100 ? These are really degenerate cases, but it might happen. However, when I checked the file svm_scale.c in the Libsvm library, I noticed this part:

 void output(int index, double value)
{
    /* skip single-valued attribute */
    if(feature_max[index] == feature_min[index])
        return;

    if(value == feature_min[index])
        value = lower;
    else if(value == feature_max[index])
        value = upper;
    else
        value = lower + (upper-lower) * 
            (value-feature_min[index])/
            (feature_max[index]-feature_min[index]);

    if(value != 0)
    {
        printf("%d:%g ",index, value);
        new_num_nonzeros++;
    }
}

So we should ignore these cases? I don't really know. As I've said, I'm not an authority on this issue, so I'm going to wait for another answer, preferably from Libsvm's authors themselves, to clear things up .....

Auteur answered 27/1, 2018 at 10:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.