data imbalance in SVM using libSVM

Asked 30/9, 2013 at 8:42 Answered 6/5, 2015 at 22:34

How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.

If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.

Burse answered 30/9, 2013 at 8:42 Comment(1)

there is a similar question in the FAQ page which may helps: Q: My data are unbalanced. Could libsvm handle such problems? csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f410 – Lawrenson 20/12, 2013 at 3:18

Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)

Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.

Lamellar answered 30/9, 2013 at 9:16 Comment(2)

The class weighting scheme implies that C gets changed in the actual SVM training problem, so class balance does have something to do with the selection of C even though it happens behind the curtains. – Cutoff 1/10, 2013 at 12:1

This is purely linguistic thing, as my intention was that selection of C won't fix the imbalance problem. The solutions of this problem however change the C, I do not see the true contradiction here – Lamellar 1/10, 2013 at 13:32

Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.

The two most popular approaches are:

Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.

Cutoff answered 1/10, 2013 at 11:58 Comment(4)

Why not oversample the smaller one instead? This won't ignore any information – Lamellar 1/10, 2013 at 16:11

@Lamellar Most situations where the latter strategy is used are large-scale problems (e.g. millions to billions of instances), in which ignoring part of the data is actually used as a hack to lower complexity. Oversampling the smaller set is basically the former approach in an inefficient way (oversampling is exactly the same as reweighing). – Cutoff 1/10, 2013 at 16:17

I'm fully aware of that, just wondering why you did not include this option. The main advantage of oversampling is that it is a generic approach, which can be used even with models (and their implementations) which do not let you weight samples (at cost of efficiency). – Lamellar 1/10, 2013 at 16:39

I think the main reason not to oversample the minority class is that, unless you're very careful to control the process, it will make cross-validation or holdout testing useless since duplicated examples can turn up in multiple folds. I suppose you could extract a holdout set first, then oversample the training set, though. – Recessive 28/11, 2014 at 15:37

I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.

As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.

You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.

In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.

Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!

SMOTE

What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.

Reuter answered 6/5, 2015 at 22:34 Comment(0)

Recommended topics

Hot tags