LibSVM turns all my training vectors into support vectors, why?

M

3

5

I am trying to use SVM for News article classification.

I created a table that contains the features (unique words found in the documents) as rows. I created weight vectors mapping with these features. i.e if the article has a word that is part of the feature vector table that location is marked as 1 or else 0.

Ex:- Training sample generated...

1 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:1 13:1 14:1 15:1 16:1 17:1 18:1 19:1 20:1 21:1 22:1 23:1 24:1 25:1 26:1 27:1 28:1 29:1 30:1

As this is the first document all the features are present.

I am using 1, 0 as class labels.

I am using svm.Net for classification.

I gave 300 weight vectors manually classified as training data and the model generated is taking all the vectors as support vectors, which is surely overfitting.

My total features (unique words/row count in feature vector DB table) is 7610.

What could be the reason?

Because of this over fitting my project is now in pretty bad shape. It is classifying every article available as a positive article.

In LibSVM binary classification is there any restriction on the class label?

I am using 0, 1 instead of -1 and +1. Is that a problem?

Marylouisemaryly answered 20/4, 2011 at 13:34 Comment(0)

H

1

As pointed out, a parameter search is probably a good idea before doing anything else.

I would also investigate the different kernels available to you. The fact that you input data is binary might be problematic for the RBF kernel (or might render it's usage sub-optimal, compared to another kernel). I have no idea which kernel could be better suited, though. Try a linear kernel, and look around for more suggestions/idea :)

For more information and perhaps better answers, look on stats.stackexchange.com.

Homoiousian answered 22/4, 2011 at 15:50 Comment(0)

D

3

You need to do some type of parameter search, also if the classes are unbalanced the classifier might get artificially high accuracies without doing much. This guide is good at teaching basic, practical things, you should probably read it

Dealt answered 20/4, 2011 at 18:18 Comment(0)

C

1

I would definitely try using -1 and +1 for your labels, that's the standard way to do it.

Also, how much data do you have? Since you're working in 7610-dimensional space, you could potentially have that many support vectors, where a different vector is "supporting" the hyperplane in each dimension.

With that many features, you might want to try some type of feature selection method like principle component analysis.

Cruciform answered 22/4, 2011 at 3:23 Comment(1)

Found the reason, this is happening because SVM.net is not checking the validity of trainingdata. In my training data feature numbers were not sorted, as a result it was generating weird results. After sorting the weight vector on feature numbers and then generating the model things are far better...74% accuracy. Thank you. – Marylouisemaryly 23/4, 2011 at 7:12

H

1

As pointed out, a parameter search is probably a good idea before doing anything else.

I would also investigate the different kernels available to you. The fact that you input data is binary might be problematic for the RBF kernel (or might render it's usage sub-optimal, compared to another kernel). I have no idea which kernel could be better suited, though. Try a linear kernel, and look around for more suggestions/idea :)

For more information and perhaps better answers, look on stats.stackexchange.com.

Homoiousian answered 22/4, 2011 at 15:50 Comment(0)

Recommended topics

Hot tags