Using LIBSVM grid.py for unbalanced data?
Asked Answered
M

4

7

I'm having a three class problem with unbalanced data (90%, 5%, 5%). Now I want to train a classifier using LIBSVM.

The problem is that LIBSVM optimizes its parameter gamma and Cost for optimal accuracy, which means that 100% of the examples are classified as class 1, which is of course not what I want.

I've tried modifying the weight parameters -w without much success.

So what I want is, modifying grid.py in a way that it optimizes Cost and gamma for precision and recall separated by classes rather than for overall accuracy. Is there any way to do that? Or are there other scripts out there that can do something like this?

Mythify answered 10/7, 2012 at 9:10 Comment(0)
R
8

The -w parameter is what you need for unbalanced data. What have you tried so far?

If your classes are:

  • class 0: 90%
  • class 1: 5%
  • class 2: 5%

You should pass the following params to svm:

-w0 5 -w1 90 -w2 90
Ryan answered 10/7, 2012 at 15:8 Comment(3)
thanks, but I think it should be the other way round: -w0 5 -w1 90 -w2 90, since the smaller class should have more costs associated with them.. this one helped!Mythify
yeah, I think you're right. I just edited my question. Thanks!Ryan
And when you have more than 3 classes, how can you attribute the value of each w ?Bacteriophage
A
4

If you want to try an alternative, one of the programs in the svmlight family, http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html, directly minimizes the area under the ROC curve.

Minimizing the AUC may give better results than re-weighting training examples.

Adame answered 14/7, 2012 at 13:9 Comment(1)
svmlight is commercial-unfriendly; it's only free for academic use.Teteak
S
0

You can optimize any of the precision, recall, F-score and AUC using grid.py. Tweak is that you have to change cross validation evaluation measure used by svm-train in LIBSVM. Follow the procedure given on LIBSVM website.

Selfdefense answered 6/3, 2017 at 13:54 Comment(0)
S
0

If you have unbalanced data, you probably shouldn't be optimizing accuracy. Instead optimize f-score (or recall, if that's more important to you). You can change the evaluation function as described here.

I think you should also optimize gamma and Cost, while using different class weight configurations. I modified the "get_cmd" function in grid.py by passing different class weights for that purpose (-wi weight). In my experience, class weighting doesn't always help.

Solarize answered 22/3, 2017 at 15:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.