What is the relation between the number of Support Vectors and training data and classifiers performance? [closed]
Asked Answered
B

5

83

I am using LibSVM to classify some documents. The documents seem to be a bit difficult to classify as the final results show. However, I have noticed something while training my models. and that is: If my training set is for example 1000 around 800 of them are selected as support vectors. I have looked everywhere to find if this is a good thing or bad. I mean is there a relation between the number of support vectors and the classifiers performance? I have read this previous post but I am performing a parameter selection and also I am sure that the attributes in the feature vectors are all ordered. I just need to know the relation. Thanks. p.s: I use a linear kernel.

Beaver answered 28/2, 2012 at 10:57 Comment(1)
Fwiw, RBF-SVM on 50000 MNIST digits (784d) gives 14398 support vectors, 29 %.Thrombosis
T
169

Support Vector Machines are an optimization problem. They are attempting to find a hyperplane that divides the two classes with the largest margin. The support vectors are the points which fall within this margin. It's easiest to understand if you build it up from simple to more complex.

Hard Margin Linear SVM

In a training set where the data is linearly separable, and you are using a hard margin (no slack allowed), the support vectors are the points which lie along the supporting hyperplanes (the hyperplanes parallel to the dividing hyperplane at the edges of the margin)

Hard-Margin SVM

All of the support vectors lie exactly on the margin. Regardless of the number of dimensions or size of data set, the number of support vectors could be as little as 2.

Soft-Margin Linear SVM

But what if our dataset isn't linearly separable? We introduce soft margin SVM. We no longer require that our datapoints lie outside the margin, we allow some amount of them to stray over the line into the margin. We use the slack parameter C to control this. (nu in nu-SVM) This gives us a wider margin and greater error on the training dataset, but improves generalization and/or allows us to find a linear separation of data that is not linearly separable.

Soft-margin Linear SVM

Now, the number of support vectors depends on how much slack we allow and the distribution of the data. If we allow a large amount of slack, we will have a large number of support vectors. If we allow very little slack, we will have very few support vectors. The accuracy depends on finding the right level of slack for the data being analyzed. Some data it will not be possible to get a high level of accuracy, we must simply find the best fit we can.

Non-Linear SVM

This brings us to non-linear SVM. We are still trying to linearly divide the data, but we are now trying to do it in a higher dimensional space. This is done via a kernel function, which of course has its own set of parameters. When we translate this back to the original feature space, the result is non-linear:

enter image description here

Now, the number of support vectors still depends on how much slack we allow, but it also depends on the complexity of our model. Each twist and turn in the final model in our input space requires one or more support vectors to define. Ultimately, the output of an SVM is the support vectors and an alpha, which in essence is defining how much influence that specific support vector has on the final decision.

Here, accuracy depends on the trade-off between a high-complexity model which may over-fit the data and a large-margin which will incorrectly classify some of the training data in the interest of better generalization. The number of support vectors can range from very few to every single data point if you completely over-fit your data. This tradeoff is controlled via C and through the choice of kernel and kernel parameters.

I assume when you said performance you were referring to accuracy, but I thought I would also speak to performance in terms of computational complexity. In order to test a data point using an SVM model, you need to compute the dot product of each support vector with the test point. Therefore the computational complexity of the model is linear in the number of support vectors. Fewer support vectors means faster classification of test points.

A good resource: A Tutorial on Support Vector Machines for Pattern Recognition

Tegular answered 28/2, 2012 at 15:33 Comment(3)
great answer! but the link is not working anymore...can you please update it?Dextrocular
A Tutorial on Support Vector Machines for Pattern Recognition (Archived from Cached version).Voracious
"The number of support vectors can range from very few to every single data point if you completely over-fit your data." To summarise, large number of support vectors is not good. So the question is, is 800 SVs out of 1000 training samples "large" ?Muzhik
C
33

800 out of 1000 basically tells you that the SVM needs to use almost every single training sample to encode the training set. That basically tells you that there isn't much regularity in your data.

Sounds like you have major issues with not enough training data. Also, maybe think about some specific features that separate this data better.

Cassiodorus answered 28/2, 2012 at 15:58 Comment(2)
I'd selected this as the answer. The long answers before are just irrelevant C&P SVM explanationsPacifically
I agree. Although the other answers tried to give a good summary, this is the most relevant to the o.p. If the portion of S.V. is large it indicates memorizing, not learning, and this means bad generalization => out of sample error (test set error) will be large.Kennel
C
12

Both number of samples and number of attributes may influence the number of support vectors, making model more complex. I believe you use words or even ngrams as attributes, so there are quite many of them, and natural language models are very complex themselves. So, 800 support vectors of 1000 samples seem to be ok. (Also pay attention to @karenu's comments about C/nu parameters that also have large effect on SVs number).

To get intuition about this recall SVM main idea. SVM works in a multidimensional feature space and tries to find hyperplane that separates all given samples. If you have a lot of samples and only 2 features (2 dimensions), the data and hyperplane may look like this:

enter image description here

Here there are only 3 support vectors, all the others are behind them and thus don't play any role. Note, that these support vectors are defined by only 2 coordinates.

Now imagine that you have 3 dimensional space and thus support vectors are defined by 3 coordinates.

enter image description here

This means that there's one more parameter (coordinate) to be adjusted, and this adjustment may need more samples to find optimal hyperplane. In other words, in worst case SVM finds only 1 hyperplane coordinate per sample.

When the data is well-structured (i.e. holds patterns quite well) only several support vectors may be needed - all the others will stay behind those. But text is very, very bad structured data. SVM does its best, trying to fit sample as well as possible, and thus takes as support vectors even more samples than drops. With increasing number of samples this "anomaly" is reduced (more insignificant samples appear), but absolute number of support vectors stays very high.

Cobbie answered 28/2, 2012 at 12:32 Comment(8)
Thanks for you answer! do you have any references for what you mentioned in the last paragraph? "When the data is well-structured (i.e. holds patterns quite well) only several support vectors may be needed - all the others will stay behind those. But text is very, very bad structured data. SVM does its best, trying to fit sample as well as possible, and thus takes as support vectors even more samples than drops." thxBeaver
This is not correct - you can have a 3 dimensional data set with only 2 support vectors, if the dataset is linearly separable and has the right distribution. You can also have the same exact dataset and end up with 80% support vectors. It all depends on how you set C. In fact, in nu-svm you can control the number of support vectors by setting nu very low (.1)Tegular
@karenu: I didn't said that growth of number of attributes always leads to growth of the number of support vectors, I just said that even with fixed C/nu parameters number of support vectors depends on number of feature dimensions and number of samples. And for text data, which is very bad structured by its nature the number of support vectors inside of margin (hard-margin SVM is inapplicable for text classification even with higher-order kernels) will be always high.Cobbie
@Hossein: I mean linear separability. Imagine the task of spam classification. If your spam messages almost always contain words like "Viagra", "buy", "money" and your ham messages contain only "home", "hello", "regards", than your data is well structured and can be easily separated on these word vectors. However, in practice you have mix of good and bad words and thus your data doesn't have any obvious patterns. If you have 3 words from spam dict and 3 from ham, how should you classify message? You need more features, and that's one of the reasons why SVM uses more support vectors.Cobbie
@Cobbie I find that misleading. Saying it depends to me makes it sound like if your dataset increases your # of sv will increase, that there is somehow a relationship between # of samples (or # of dimensions) and # of support vectors. There is a relationship between the complexity of the model and SVs, and bigger datasets with higher dimensions do have a tendency to have more complex models, but the size of the dataset or dimensionality doesn't directly dictate the # of SVs.Tegular
@karenu: I agree that this is just tendency, and I never said that # of dimensions or # of samples has direct impact on the number of support vectors. You are trying to prove that my answer is incorrect because it doesn't cover all details of SVM. But I'm not trying to do that, because question is not about it. Author was confused about high number of SVs, so I tried to explain this phenomena with text documents in the most simple way. There's plenty papers on SVM, so I believe if author was interested in all details, he could read them himself.Cobbie
Maybe you could reword your answer then? When you say "Number of support vectors depends on both - number of samples and number of attributes." it sounds to me like you are saying exactly that, perhaps that is why we are misunderstanding each other.Tegular
@karenu: ah, sure, I'll reword it and add some notes.Cobbie
A
8

SVM classification is linear in the number of support vectors (SVs). The number of SVs is in the worst case equal to the number of training samples, so 800/1000 is not yet the worst case, but it's still pretty bad.

Then again, 1000 training documents is a small training set. You should check what happens when you scale up to 10000s or more documents. If things don't improve, consider using linear SVMs, trained with LibLinear, for document classification; those scale up much better (model size and classification time are linear in the number of features and independent of the number of training samples).

Aorta answered 28/2, 2012 at 11:3 Comment(10)
Just curious, what makes you assume the OP isn't already using a linear SVM? I must have missed it if was using some non-linear kernel.Cassiodorus
@ChrisA.: they might be using a linear SVM, but then one implemented in LibSVM, which is suboptimal compared to the ones in LibLinear.Aorta
@Hossein: then use LibLinear instead of LibSVM. It's much faster in training and its classification does not use support vectors but a single coefficient vector.Aorta
It's suboptimal in terms of speed only right? I thought all SVM's were QP, a convex optimization function with a unique minimum.Cassiodorus
@ChrisA.: yes, I'm talking about speed only. The accuracy should be roughly the same when the same settings are used (although both LibSVM and LibLinear use some randomization, so it's not even guaranteed to be the same across multiple runs of the same training algorithm).Aorta
Wait, does this randomization make it's way through to the final classifier? I haven't looked at the code for either library, but this violates my whole understanding of this being a convex optimization problem with a unique minimum.Cassiodorus
This randomization is only done in the training stage as a speed-up. The optimization problem is indeed convex.Aorta
Randomization just helps it converge to the optimal solution faster.Tegular
In theory, yes, but I've seen it cause small changes in the final accuracy.Aorta
@larsmans "1000 training documents is a small training set." I'm interestesd why you think 1000 is "small" here. Are there some results somewhere on number of training points needed vs number of attributes?Woolfolk
P
0

There is some confusion between sources. In the textbook ISLR 6th Ed, for instance, C is described as a "boundary violation budget" from where it follows that higher C will allow for more boundary violations and more support vectors. But in svm implementations in R and python the parameter C is implemented as "violation penalty" which is the opposite and then you will observe that for higher values of C there are fewer support vectors.

Phuongphycology answered 15/2, 2018 at 14:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.