libsvm Shrinking Heuristics
Asked Answered
H

1

18

I'm using libsvm in C-SVC mode with a polynomial kernel of degree 2 and I'm required to train multiple SVMs. During training, I am getting either one or even both of these warnings for some of the SVMs that I train:

WARNING: using -h 0 may be faster
*
WARNING: reaching max number of iterations
optimization finished, #iter = 10000000

I've found the description for the h parameter:

-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)

and I've tried to read the explanation from the libsvm documentation, but it's a bit too high level for me. Can anyone please provide a layman's explanation and, perhaps, some suggestions like setting this would be beneficial because...? Also, it would be helpful to know if by setting this parameter for all the SVMs that I train, might produce negative impact on accuracy for those SVMs that do not explicitly give this warning.

I'm not sure what to make of the other warning.

Just to give more details: my training sets have 10 attributes (features) and they consist of 5000 vectors.


Update:

In case anybody else is getting the "reaching max number of iterations", it seems to be caused by numeric stability issues. Also, this will produce a very slow training time. Polynomial kernels do benefit from using cross-validation techniques to determine the best value for regularization (the C parameter), and, in the case of polynomial kernels, for me it helped to keep it smaller than 8. Also, if the kernel is inhomogeneous \sum(\gamma x_i s_i + coef0)^d (sorry, LaTeX is not supported on SO), where coef0 != 0, then cross validation can be implemented with a grid search technique for both gamma and C, since, in this case, the default value for gamma (1 / number_of_features) might not be the best choice. Still, from my experiments, you probably do not want gamma to be too big, since it will cause numeric issues (I am trying a maximum value of 8 for it).

For further inspiration on the possible values for gamma and C one should try poking in grid.py.

Herpetology answered 19/9, 2012 at 17:55 Comment(3)
Please explain how to come out with that gamma equals 1 over number of features and gamma upper limit to eight. Thanks.Vastha
@CloudCho It has been quite a few years since then and I can't recall precisely, but I believe I started with the default value (1/num_features - see here) and I tried to increase it gradually until I started getting that max iterations warning. If you want to get some good starting values for gamma and C, you'll need to trace how these values get transformed until they're fed to svmtrain.Herpetology
@CloudCho Also, it's super-important to scale your training data before trying to train a model because otherwise you'll run into numerical issues and your model will perform poorly. libsvm provides a tool called svm-scale for this purpose. See hereHerpetology
C
12

The shrinking heuristics are there to speed up the optimization. As it says in the FAQ, they sometimes help, and sometimes they do not. I believe it's a matter of runtime, rather than convergence.

The fact that the optimization reaches the maximum number of iterations is interesting, though. You might want to play with the tolerance (cost parameter), or have a look at the individual problems that cause this. Are the datasets large?

Camper answered 20/9, 2012 at 10:30 Comment(14)
Thanks for the answer! I think you are right regarding the shrinking heuristics. They just help train the models faster.Herpetology
Regarding the maximum iterations, my datasets have 5000 items each. The training takes less than one minute. What is the cost parameter? Is it the regularization? Right now I'm just setting it to 1, the default value in libsvm...Herpetology
Oh, I should clarify this: my training sets have 10 attributes / features and they consist of 5000 vectors.Herpetology
@MihaiTodor that should not present a problem for SVM, I think, unless you have many points with different labels and exactly the same feature vectors. The cost parameter is -c in LIBSVM, it defines how much you penalize classification errors. If it's too high, and the dataset isn't linearly separable in your kernel space, it might cause trouble.Camper
@MihaiTodor did you mean 5000 training instances and 10-dimensional feature space?...Camper
Yes, that is what I have and right now I'm seeing the iterations warning with c set to 1 (the default value). I plan to tweak both c and gamma to get better accuracy using cross-validation and grid search, but what should I do when I get this warning?Herpetology
@MihaiTodor check the datasets that cause trouble. It really shouldn't take so long with only 10 dimensions. Any particular reason for using polynomial kernel, by the way?Camper
Well, the short story is that we want to build a secure recommender system that receives an encrypted vector from a client and exploits the homomorphic addition property of certain public-key encryption schemes in order to compute the prediction. Because the scheme is not fully homomorphic, we can only use certain kernels, like the polynomial one. How should I "check" the datasets and which method should I use? Eye inspection does not reveal much. Here is a sample: part1 and part2Herpetology
I see.. Make sure you scale the data prior to training. See svm-scale for details. I think that might be the problem.Camper
Yes, I also was thinking about this, but, unfortunately, due to the nature of our system, that will not be an option because we won't be able to match the same feature scaling on the client data. So, based on your experience and the data I provided, what should I expect from the model when I get this warning?Herpetology
Why not? You don't have to adjust the scaling for the test data, just reapply a given one, determined from the training data csie.ntu.edu.tw/~cjlin/libsvm/faq.html#f407 LIBSVM expects scaled data, at least roughly in the [-1;1] interval, and it seems to solve the problem with the test data you posted above.Camper
As to the warning, if the model did not converge, you shouldn't make any assumptions about it's performance on the test data. I.e. it might just yield random results.Camper
I see... Well, the problem is that the test data needs to be fed to 30+ SVMs, so either I have to ask the client to scale the data for each particular SVM, encrypt it and send it (not acceptable), or I have to ask the server to scale the encrypted data (not feasable / interactive secure division is way too expensive). I'll have to discuss with my colleagues and see if we can find a way to get around this issue. Thank you very much.Herpetology
@Qnan, high values for C don't cause problems for soft-margin SVM, whether the data is separable or not. Training time may be higher, but the optimization problem is always feasible for any positive, finite value of C.Outdoors

© 2022 - 2024 — McMap. All rights reserved.