Converting LinearSVC's decision function to probabilities (Scikit learn python )
Asked Answered
T

5

53

I use linear SVM from scikit learn (LinearSVC) for binary classification problem. I understand that LinearSVC can give me the predicted labels, and the decision scores but I wanted probability estimates (confidence in the label). I want to continue using LinearSVC because of speed (as compared to sklearn.svm.SVC with linear kernel) Is it reasonable to use a logistic function to convert the decision scores to probabilities?

import sklearn.svm as suppmach
# Fit model:
svmmodel=suppmach.LinearSVC(penalty='l1',C=1)
predicted_test= svmmodel.predict(x_test)
predicted_test_scores= svmmodel.decision_function(x_test) 

I want to check if it makes sense to obtain Probability estimates simply as [1 / (1 + exp(-x)) ] where x is the decision score.

Alternately, are there other options wrt classifiers that I can use to do this efficiently?

Thanks.

Thickening answered 21/10, 2014 at 2:29 Comment(0)
C
18

I took a look at the apis in sklearn.svm.* family. All below models, e.g.,

  • sklearn.svm.SVC
  • sklearn.svm.NuSVC
  • sklearn.svm.SVR
  • sklearn.svm.NuSVR

have a common interface that supplies a

probability: boolean, optional (default=False) 

parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.

enter image description here

I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.

Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.

Comte answered 21/10, 2014 at 3:32 Comment(4)
Thank you @Comte for the response. All that you said above makes complete sense and I've accepted it as the answer. However the reason I'm not using any other classifier is because their speed is usually much less than that of sklearn.svm.LinearSVC. I'll keep looking for a while more and will update here if I find something..Thickening
It's not available because it isn't built into Liblinear, which implements LinearSVC, and also because LogisticRegression is already available (though linear SVM + Platt scaling might have some benefits over straight LR, I never tried that). The Platt scaling in SVC comes from LibSVM.Dimarco
Another possible issue is that using LinearSVC allows choosing a different penalty than the default 'l2'. SVC does not allow this, since I guess LibSVM does not allow this.Tuchun
I used both SVC(kernel='linear', **kwargs) and CalibratedClassifier(LinearSVC(**kwargs)), but I got different results...Metagnathous
C
133

scikit-learn provides CalibratedClassifierCV which can be used to solve this problem: it allows to add probability output to LinearSVC or any other classifier which implements decision_function method:

svm = LinearSVC()
clf = CalibratedClassifierCV(svm) 
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)

User guide has a nice section on that. By default CalibratedClassifierCV+LinearSVC will get you Platt scaling, but it also provides other options (isotonic regression method), and it is not limited to SVM classifiers.

Comeon answered 26/9, 2016 at 21:19 Comment(4)
Any idea how this can be used in grid search? Trying to set the parameters e.g. base_estimator__C but GridSearchCV doesn't swallow that.Lelialelith
base_estimator__C looks correct. I suggest providing a complete example and opening a new SO question.Comeon
not fit the svm when I fit the clf leads me an error. I've to train both. I think nothing change. Is it right?Odericus
Oh my god, this is sooooo much faster (and similar performance in my case)Eld
C
18

I took a look at the apis in sklearn.svm.* family. All below models, e.g.,

  • sklearn.svm.SVC
  • sklearn.svm.NuSVC
  • sklearn.svm.SVR
  • sklearn.svm.NuSVR

have a common interface that supplies a

probability: boolean, optional (default=False) 

parameter to the model. If this parameter is set to True, libsvm will train a probability transformation model on top of the SVM's outputs based on idea of Platt Scaling. The form of transformation is similar to a logistic function as you pointed out, however two specific constants A and B are learned in a post-processing step. Also see this stackoverflow post for more details.

enter image description here

I actually don't know why this post-processing is not available for LinearSVC. Otherwise, you would just call predict_proba(X) to get the probability estimate.

Of course, if you just apply a naive logistic transform, it will not perform as well as a calibrated approach like Platt Scaling. If you can understand the underline algorithm of platt scaling, probably you can write your own or contribute to the scikit-learn svm family. :) Also feel free to use the above four SVM variations that support predict_proba.

Comte answered 21/10, 2014 at 3:32 Comment(4)
Thank you @Comte for the response. All that you said above makes complete sense and I've accepted it as the answer. However the reason I'm not using any other classifier is because their speed is usually much less than that of sklearn.svm.LinearSVC. I'll keep looking for a while more and will update here if I find something..Thickening
It's not available because it isn't built into Liblinear, which implements LinearSVC, and also because LogisticRegression is already available (though linear SVM + Platt scaling might have some benefits over straight LR, I never tried that). The Platt scaling in SVC comes from LibSVM.Dimarco
Another possible issue is that using LinearSVC allows choosing a different penalty than the default 'l2'. SVC does not allow this, since I guess LibSVM does not allow this.Tuchun
I used both SVC(kernel='linear', **kwargs) and CalibratedClassifier(LinearSVC(**kwargs)), but I got different results...Metagnathous
D
15

If you want speed, then just replace the SVM with sklearn.linear_model.LogisticRegression. That uses the exact same training algorithm as LinearSVC, but with log-loss instead of hinge loss.

Using [1 / (1 + exp(-x))] will produce probabilities, in a formal sense (numbers between zero and one), but they won't adhere to any justifiable probability model.

Dimarco answered 21/10, 2014 at 21:5 Comment(5)
This one should be the real answer. I replaced my sklearn.svm.SVC with sklearn.linear_model.LogisticRegression and not only got similar ROC curves but the time difference is so huge for my dataset (seconds vs. hours) that it's not even worth a timeit. It's worth noting too that you can specify your solver to be 'liblinear' which really would make it exactly the same as LinearSVC.Inaccuracy
what would be the x value in the equation [1 / (1 + exp(-x))]?Servomotor
I don't consider this as an appropriate solution to getting probabilities with SVM as Fred did note. LR is intended for probability estimation of independent signals via logistic function. SVM is intended to provide better accuracy and attempt to not overfit, but the probability estimates you would get are less accurate via hinge function. It penalizes mispredictions. Readers, please understand the tradeoff and select the most appropriate function for your learning objective. I'm going with LinearSVC+CalibratedClassifierCV personally.Stigma
@thefourtheye: LinearSVC states: "Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples." So if you had used LinearSVC, as OP did, you would have used liblinear (just as your LogisticRegression) and it would also have been fast. So it is not the method that makes it fast: You used the wrong implementational backend.Pani
@Fred: First, if you use the same algorithm, but different loss you will get a difference - so, no, they are not the same. Second, a probability is not simply "a number between zero and one", so if your suggestion does not "adhere to any justifiable probability model", then it is precisely NOT a probability (in a formal sense or any other sense). It is just a number between zero and one. However, you are also wrong that it does not adhere to any justifiable probability model - it does (youtube.com/watch?v=BfKanl1aSG0) - so it is true that it is a probability afterall.Pani
I
3

If what your really want is a measure of confidence rather than actual probabilities, you can use the method LinearSVC.decision_function(). See the documentation.

Ic answered 12/6, 2018 at 13:43 Comment(0)
F
0

Just as an extension for binary classification with SVMs: You could also take a look at SGDClassifier which performs a gradient Descent with a SVM by default. For estimation of the binary-probabilities it uses the modified huber loss by

(clip(decision_function(X), -1, 1) + 1) / 2)

An example would look like:

from sklearn.linear_model import SGDClassifier
svm = SGDClassifier(loss="modified_huber") 
svm.fit(X_train, y_train)
proba = svm.predict_proba(X_test)
Fogarty answered 29/4, 2022 at 9:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.