SVM - relation between the number of training samples and the number of features
Asked Answered
T

2

6

What should be the relation between the number of training samples and the dimensionality of the training data?

For example, I am having a case with 20000 training samples with 16000 features. I am considering the case of using PCA to obtain some dimensionality reduction, but I do not know to how many dimensions should I reduce my training data. Is there a relation between these? I am using a Support Vector Machine classifier with 2 classes and a linear kernel.

Testy answered 6/11, 2013 at 13:33 Comment(4)
for two class discrimination and multivariate gaussian distribution of independent input features, number of training samples per class should be greater than three times the number of features.(ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1054863). Cover has shown that if the total number of training samples less than twice the number of features,then there exists a hyperplane which can separate the training data perfectly even if the two classes are generated by the same distribution (dtic.mil/dtic/tr/fulltext/u2/a229035.pdf)(pg 46).Java
i chanced upon it during my neural network studies. i am not sure how relevant the results are to svm formulation but maybe useful as a rule of thumb..Java
This question appears to be off-topic because it is not about programming. This belongs on stats.stackexchange.com.Surber
Did any of the answers help you? If not, why not?Anthracite
A
4

The solution found by the SVM is automatically restricted to the space spanned by the samples, so using PCA to just get rid of dimensions with zero variance won't change the solution. And as damienfrancois wrote, reducing beyond that you risk destroying relevant information. To avoid that you have two options:

1) Trust that structural risk minimization is not just an interesting theoretical concept, but also does the right thing for your application, and just use the data as they are.

2) Use a feature selection algorithm to find out which of the features / combinations are actually informative. However, finding the optimum combination of features is clearly not feasible for so many features, so you can probably just sort the features by individual performance (in the linear case: a t-score) and then test how many of the best features you need in order get a good result.

Zaw Lin's comment is of course correct, you can always separate classes in such a high-dimensional space, but equally of course classifier performance should not be assessed on the training data but e.g. using cross-validation.

Anthracite answered 6/11, 2013 at 18:26 Comment(0)
S
2

It is all a matter of intrinsic dimensionality of the data and VC-dimenionality. But hose theoretical concepts won't be of much help in practice.

In practice, with a properly regularized linear SVM, you might simply work with the 16k features.

If you want to use PCA, look at the scree plot to get the percentage of variance kept by using n principal components with highest corresponding eigen values. But PCA-based feature extraction will 'destroy' information if the relationship between the label and the data is highly nonlinear.

Another option is to use other L1-regularized models such as LASSO.

Scandal answered 6/11, 2013 at 13:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.