Controlling the threshold in Logistic Regression in Scikit Learn
Asked Answered



I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.

I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.

Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?

I did not find anything in the documentation page.

Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?

Interlinear answered 25/2, 2015 at 10:11 Comment(1)
Well, since LR is a probabilistic classifier, that is, it returns probability of a class, it makes sense to use 0.5 as a threshold.Colophon

Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:

One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.

As another option, one can graphically view precision vs. recall at various thresholds using the following code.

### Predict test_y values and probabilities based on fitted logistic 
regression model


  # probs_y is a 2-D array of probability of being labeled as 0 
  # (first column of array) vs 1 (2nd column in array)

from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:, 
   #retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)

plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.legend(loc="lower left")
Abbe answered 30/12, 2018 at 2:2 Comment(3)
instantiate logistic regression in sklearn, make sure you have a test and train dataset partitioned and labeled as test_x, test_y, run (fit) the logisitc regression model on this data, the rest should follow from here.Abbe
You can save a bit of coding by using sklearn.metrics.plot_precision_recall_curve.Mcglone
Function plot_precision_recall_curve is deprecated in 1.0 and will be removed in 1.2.Tsingyuan

There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;

pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
    test_accuracy = metrics.accuracy_score(Y_test.values.reshape(Y_test.values.size,1),
    print('Our testing accuracy is {}'.format(test_accuracy))



Lucialucian answered 15/5, 2018 at 8:15 Comment(2)
I like this answer. What I am struggling to understand is how would one tie this into GridSearchCV? When I am running GridSearchCV, I am finding the best model among many. Presumably, the default threshold for Logistic Regression of 0.5 is being used internally and so then how would I override this default threshold when scoring takes place to pick the best model.Jadejaded
@Jadejaded you can use threshold-independent metric like roc_auc to find the best parameters through GridSearch, and then set the threshold manually after having identified the best parametersJutta

Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).

Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.

Jayson answered 25/2, 2015 at 16:48 Comment(0)

Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:

One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.

As another option, one can graphically view precision vs. recall at various thresholds using the following code.

### Predict test_y values and probabilities based on fitted logistic 
regression model


  # probs_y is a 2-D array of probability of being labeled as 0 
  # (first column of array) vs 1 (2nd column in array)

from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:, 
   #retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)

plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.legend(loc="lower left")
Abbe answered 30/12, 2018 at 2:2 Comment(3)
instantiate logistic regression in sklearn, make sure you have a test and train dataset partitioned and labeled as test_x, test_y, run (fit) the logisitc regression model on this data, the rest should follow from here.Abbe
You can save a bit of coding by using sklearn.metrics.plot_precision_recall_curve.Mcglone
Function plot_precision_recall_curve is deprecated in 1.0 and will be removed in 1.2.Tsingyuan

we can use a wrapper as follows:

model = LogisticRegression(), y)

def custom_predict(X, threshold):
    probs = model.predict_proba(X) 
    return (probs[:, 1] > threshold).astype(int)
new_preds = custom_predict(X=X, threshold=0.4) 
Prelusive answered 9/11, 2022 at 10:57 Comment(1)
Neat and clean !Moderation

If using @jazib jamil's and @Halee's solution in Pandas version 0.23.0+, replace .as_matrix() with .values (documentation).

pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
    print ('\n******** For i = {} ******'.format(i))
    Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
    test_accuracy = metrics.accuracy_score(Y_test.values.reshape(Y_test.values.size,1),
    print('Our testing accuracy is {}'.format(test_accuracy))

Selfrestraint answered 23/6, 2023 at 17:45 Comment(0)

For probabilistic classifiers such as logistic regression, the optimal Bayes estimator is

ŷ = argmaxy P(Y = y | X)

that is, the predicted class with the highest probability. For binary classification, this is equivalent to using probability 0.5 as a threshold.

As Nikita Astrakhantsev says, the threshold can be adjusted to control false positives and false negatives, therefore sensitivity/specificity, depending on the business (determined outside of statistics) requirements of the model. For highly unbalanced datasets, logistic regression may benefit from oversampling.

Dynamo answered 9/10, 2023 at 15:49 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.