svm scaling input values
Asked Answered
S

3

16

I am using libSVM. Say my feature values are in the following format:

                         instance1 : f11, f12, f13, f14
                         instance2 : f21, f22, f23, f24
                         instance3 : f31, f32, f33, f34
                         instance4 : f41, f42, f43, f44
                         ..............................
                         instanceN : fN1, fN2, fN3, fN4

I think there are two scaling can be applied.

  1. scale each instance vector such that each vector has zero mean and unit variance.

        ( (f11, f12, f13, f14) - mean((f11, f12, f13, f14) ). /std((f11, f12, f13, f14) )
    
  2. scale each colum of the above matrix to a range. for example [-1, 1]

According to my experiments with RBF kernel (libSVM) I found that the second scaling (2) improves the results by about 10%. I did not understand the reason why (2) gives me a improved results.

Could anybody explain me what is the reason for applying scaling and why the second option gives me improved results?

Saxton answered 15/3, 2013 at 15:36 Comment(3)
Before to try to answer this... Is each column in the same range? for instance fn1 and fnm are [0,100]?Prepossess
no it could be any range. for example the first column represents the age, and the second represents the salary, etc.Saxton
Well depending on how you calculate the mean and the standard deviation they could be biased by the biggest range. On the other hand, I don't think there is guaranteed that that scale would be in the range [-1,1] which is the numerical friendly range for RBF on libSVMPrepossess
U
21

The standard thing to do is to make each dimension (or attribute, or column (in your example)) have zero mean and unit variance.

This brings each dimension of the SVM into the same magnitude. From http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf:

The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. Another advantage is to avoid numerical diculties during the calculation. Because kernel values usually depend on the inner products of feature vectors, e.g. the linear kernel and the polynomial ker- nel, large attribute values might cause numerical problems. We recommend linearly scaling each attribute to the range [-1,+1] or [0,1].

Uvula answered 15/3, 2013 at 20:36 Comment(1)
Yes, scaling columns is the normal way to do it. Scaling rows doesn't really make sense: if your only two features were age (in years) and salary (in thousands of dollars), then a 15-year-old making $15,000 and a 60-year-old making $60,000 would be made to appear exactly identical!Schoolgirl
C
4

I believe that it comes down to your original data a lot.

If your original data has SOME extreme values for some columns, then in my opinion you lose some definition when scaling linearly, for example in the range [-1,1].

Let's say that you have a column where 90% of values are between 100-500 and in the remaining 10% the values are as low as -2000 and as high as +2500.

If you scale this data linearly, then you'll have:

-2000 -> -1 ## <- The min in your scaled data
+2500 -> +1 ## <- The max in your scaled data

 100 -> -0.06666666666666665 
 234 -> -0.007111111111111068
 500 ->  0.11111111111111116

You could argue that the discernibility between what was originally 100 and 500 is smaller in the scaled data in comparison to what it was in the original data.

At the end, I believe it very much comes down to the specifics of your data and I believe the 10% improved performance is very coincidental, you will certainly not see a difference of this magnitude in every dataset you try both scaling methods on.

At the same time, in the paper in the link listed in the other answer, you can clearly see that the authors recommend data to be scaled linearly.

I hope someone finds this useful!

Chondrule answered 7/9, 2016 at 18:33 Comment(1)
Yes. You can cap/floor to remove extreme values and then apply some transform which dilates the range of data in which I suspect most discrimination occurs. I have seen real examples where this improves things a lot.Depersonalization
K
0

The accepted answer speaks of "Standard Scaling", which is not efficient for high-dimensional data stored in sparse matrices (text data is a use-case); in such cases, you may resort to "Max Scaling" and its variants, which works with sparse matrices.

Keithakeithley answered 9/3, 2022 at 12:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.