SVC classifier taking too much time for training

Asked 27/12, 2018 at 5:43 Answered 12/2, 2022 at 9:34

Solved machine-learning deep-learning data-science

I am using SVC classifier with Linear kernel to train my model. Train data: 42000 records

    model = SVC(probability=True)
    model.fit(self.features_train, self.labels_train)
    y_pred = model.predict(self.features_test)
    train_accuracy = model.score(self.features_train,self.labels_train)
    test_accuracy = model.score(self.features_test, self.labels_test)

It takes more than 2 hours to train my model. Am I doing something wrong? Also, what can be done to improve the time

Thanks in advance

Woolpack answered 27/12, 2018 at 5:43 Comment(4)

How many features are there per training example? – Rosemonde 27/12, 2018 at 5:46

Actually, the data is a text data. Per record, the text size varies from 100-200 words. – Woolpack 27/12, 2018 at 6:3

Are you using some kind of a word2vec? if yes, check the embedding dimension – Glasshouse 27/12, 2018 at 8:27

No, using tfidfVectorizor. Not word2Vec. – Woolpack 27/12, 2018 at 8:42

There are several possibilities to speed up your SVM training. Let n be the number of records, and d the embedding dimensionality. I assume you use scikit-learn.

Reducing training set size. Quoting the docs:

The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples.

O(n^2) complexity will most likely dominate other factors. Sampling fewer records for training will thus have the largest impact on time. Besides random sampling, you could also try instance selection methods. For example, principal sample analysis has been proposed recently.
Reducing dimensionality. As others have hinted at in their comments, embedding dimension also impacts runtime. Computing inner products for the linear kernel is in O(d). Dimensionality reduction can, therefore, also reduce runtime. In another question, latent semantic indexing was suggested specifically for TF-IDF representations.
Parameters. Use SVC(probability=False) unless you need the probabilities, because they "will slow down that method." (from the docs).
Implementation. To the best of my knowledge, scikit-learn just wraps around LIBSVM and LIBLINEAR. I am speculating here, but you may be able to speed this up by using efficient BLAS libraries, such as in Intel's MKL.
Different classifier. You may try sklearn.svm.LinearSVC, which is...

[s]imilar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.

Moreover, a scikit-learn dev suggested the kernel_approximation module in a similar question.

Dhiren answered 2/1, 2019 at 9:37 Comment(1)

@Deepankar Please consider upvoting my answer in addition to accepting it, if you find it useful. Thanks :) – Dhiren 2/1, 2019 at 13:22

I had the same issue, but scaling the data solved the problem

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Viosterol answered 16/8, 2021 at 7:57 Comment(0)

You can try using accelerated implementations of algorithms - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex

For SVM you for sure would be able to get higher compute efficiency.

First install package

pip install scikit-learn-intelex

And then add in your python script

from sklearnex import patch_sklearn
patch_sklearn()

Note that: "You have to import scikit-learn after these lines. Otherwise, the patching will not affect the original scikit-learn estimators." (from docs)

Patino answered 12/2, 2022 at 9:34 Comment(0)

Try using the following code. I had similar issue with similar size of the training data. I changed it to following and the response was way faster

model = SVC(gamma='auto')

Shirberg answered 8/10, 2019 at 10:21 Comment(0)

Recommended topics

Hot tags