Scaling data in scikit-learn SVM
Asked Answered
J

2

13

While libsvm provides tools for scaling data, with Scikit-Learn (which should be based upon libSVM for the SVC classifier) I find no way to scale my data.

Basically I want to use 4 features, of which 3 range from 0 to 1 and the last one is a "big" highly variable number.

If I include the fourth feature in libSVM (using the easy.py script which scales my data automatically) I get some very nice results (96% accuracy). If I include the fourth variable in Scikit-Learn the accuracy drops to ~78% - but if I exclude it, I get the same results I get in libSVM when excluding that feature. Therefore I am pretty sure it's a problem of missing scaling.

How do I replicate programmatically (i.e. without calling svm-scale) the scaling process of SVM?

Jeanicejeanie answered 10/11, 2012 at 17:3 Comment(0)
B
10

You have that functionality in sklearn.preprocessing:

>>> from sklearn import preprocessing
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled                                          
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

The data will then have zero mean and unit variance.

Bojorquez answered 10/11, 2012 at 17:8 Comment(4)
Good to know, thanks. Should I standardize the test data together with the train data and slice them afterwards or should I only perform test data by itself?Jeanicejeanie
That is mentioned in the documentation. I guess you should do it separately, otherwise the training data would be influenced by the test samples. With the Scaler class you can calculate the mean and standard deviation of the training data and then apply the same transformation to the test data.Bojorquez
You should use a Scaler for this, not the freestanding function scale. A Scaler can be plugged into a Pipeline, e.g. scaling_svm = Pipeline([("scaler", Scaler()), ("svm", SVC(C=1000))]).Piddle
Does the Scaler do standardization separately to training and testing data in Pipeline? Or it firstly standardize the whole data set before feeding to svm?Argumentative
S
0

You can also try StandardScalerfor datascaling :

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(Xtrain) # where X is your data to be scaled
Xtrain = scaler.transform(Xtrain)
Sharmainesharman answered 12/10, 2020 at 16:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.