Controlling the threshold in Logistic Regression in Scikit Learn
Asked Answered
I

6

46

I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.

I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.

Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?

I did not find anything in the documentation page.

Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?

Interlinear answered 25/2, 2015 at 10:11 Comment(1)
Well, since LR is a probabilistic classifier, that is, it returns probability of a class, it makes sense to use 0.5 as a threshold.Colophon
A
23

Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:

One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.

As another option, one can graphically view precision vs. recall at various thresholds using the following code.

### Predict test_y values and probabilities based on fitted logistic 
regression model

pred_y=log.predict(test_x) 

probs_y=log.predict_proba(test_x) 
  # probs_y is a 2-D array of probability of being labeled as 0 
  # (first column of array) vs 1 (2nd column in array)


from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:, 
1]) 
   #retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)

plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])
Abbe answered 30/12, 2018 at 2:2 Comment(3)
instantiate logistic regression in sklearn, make sure you have a test and train dataset partitioned and labeled as test_x, test_y, run (fit) the logisitc regression model on this data, the rest should follow from here.Abbe
You can save a bit of coding by using sklearn.metrics.plot_precision_recall_curve.Mcglone
Function plot_precision_recall_curve is deprecated in 1.0 and will be removed in 1.2.Tsingyuan
L
37

There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;

pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
    test_accuracy = metrics.accuracy_score(Y_test.values.reshape(Y_test.values.size,1),
                                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1))
    print('Our testing accuracy is {}'.format(test_accuracy))

    print(confusion_matrix(Y_test.values.reshape(Y_test.values.size,1),
                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1)))

Best!

Lucialucian answered 15/5, 2018 at 8:15 Comment(2)
I like this answer. What I am struggling to understand is how would one tie this into GridSearchCV? When I am running GridSearchCV, I am finding the best model among many. Presumably, the default threshold for Logistic Regression of 0.5 is being used internally and so then how would I override this default threshold when scoring takes place to pick the best model.Jadejaded
@Jadejaded you can use threshold-independent metric like roc_auc to find the best parameters through GridSearch, and then set the threshold manually after having identified the best parametersJutta
J
25

Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).

Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.

Jayson answered 25/2, 2015 at 16:48 Comment(0)
A
23

Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:

One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.

As another option, one can graphically view precision vs. recall at various thresholds using the following code.

### Predict test_y values and probabilities based on fitted logistic 
regression model

pred_y=log.predict(test_x) 

probs_y=log.predict_proba(test_x) 
  # probs_y is a 2-D array of probability of being labeled as 0 
  # (first column of array) vs 1 (2nd column in array)


from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:, 
1]) 
   #retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)

plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])
Abbe answered 30/12, 2018 at 2:2 Comment(3)
instantiate logistic regression in sklearn, make sure you have a test and train dataset partitioned and labeled as test_x, test_y, run (fit) the logisitc regression model on this data, the rest should follow from here.Abbe
You can save a bit of coding by using sklearn.metrics.plot_precision_recall_curve.Mcglone
Function plot_precision_recall_curve is deprecated in 1.0 and will be removed in 1.2.Tsingyuan
P
8

we can use a wrapper as follows:

model = LogisticRegression()
model.fit(X, y)

def custom_predict(X, threshold):
    probs = model.predict_proba(X) 
    return (probs[:, 1] > threshold).astype(int)
    
    
new_preds = custom_predict(X=X, threshold=0.4) 
Prelusive answered 9/11, 2022 at 10:57 Comment(1)
Neat and clean !Moderation
S
0

If using @jazib jamil's and @Halee's solution in Pandas version 0.23.0+, replace .as_matrix() with .values (documentation).

pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
    test_accuracy = metrics.accuracy_score(Y_test.values.reshape(Y_test.values.size,1),
                                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1))
    print('Our testing accuracy is {}'.format(test_accuracy))

    print(confusion_matrix(Y_test.values.reshape(Y_test.values.size,1),
                           Y_test_pred.iloc[:,1].values.reshape(Y_test_pred.iloc[:,1].values.size,1)))
Selfrestraint answered 23/6, 2023 at 17:45 Comment(0)
D
0

For probabilistic classifiers such as logistic regression, the optimal Bayes estimator is

ŷ = argmaxy P(Y = y | X)

that is, the predicted class with the highest probability. For binary classification, this is equivalent to using probability 0.5 as a threshold.

As Nikita Astrakhantsev says, the threshold can be adjusted to control false positives and false negatives, therefore sensitivity/specificity, depending on the business (determined outside of statistics) requirements of the model. For highly unbalanced datasets, logistic regression may benefit from oversampling.

Dynamo answered 9/10, 2023 at 15:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.