Scikit-learn classifiers generally choose the predicted class by taking the argmax
of scores/probabilities (see LogisticRegression and DecisionTreeClassifier).
For binary classification problems, the argmax
is equal to using a 0.5 threshold on probabilities. In this case, varying the threshold changes your confidence about the predicted classes.
You can tune/change the threshold according to your goals (i.e. maximize precision or recall). The concept is clearly explained in this post. It's possible to automate the finding of the optimal threshold, of any classifier, by retrieving the predicted probabilities and optimizing a metric of interest on a validation set. This is done by ThresholdClassifier
:
import numpy as np
from sklearn.metrics import fbeta_score
from sklearn.model_selection import train_test_split
from sklearn.base import clone, BaseEstimator, ClassifierMixin
class ThresholdClassifier(BaseEstimator, ClassifierMixin):
def __init__(self, estimator, refit=True, val_size=0.3):
self.estimator = estimator
self.refit = refit
self.val_size = val_size
def fit(self, X, y):
def scoring(th, y, prob):
pred = (prob > th).astype(int)
return 0 if not pred.any() else \
-fbeta_score(y, pred, beta=0.1)
X_train, X_val, y_train, y_val = train_test_split(
X, y, stratify=y, test_size=self.val_size,
shuffle=True, random_state=1234
)
self.estimator_ = clone(self.estimator)
self.estimator_.fit(X_train, y_train)
prob_val = self.estimator_.predict_proba(X_val)[:,1]
thresholds = np.linspace(0,1, 200)[1:-1]
scores = [scoring(th, y_val, prob_val)
for th in thresholds]
self.score_ = np.min(scores)
self.th_ = thresholds[np.argmin(scores)]
if self.refit:
self.estimator_.fit(X, y)
if hasattr(self.estimator_, 'classes_'):
self.classes_ = self.estimator_.classes_
return self
def predict(self, X):
proba = self.estimator_.predict_proba(X)[:,1]
return (proba > self.th_).astype(int)
def predict_proba(self, X):
return self.estimator_.predict_proba(X)
When calling fit
:
- a validation (
X_val
and y_val
) set is randomly generated from the received data;
- the
estimator
is fitted on X_train
and y_train
;
- probabilities (
prob_val
) are retrieved for the 1's class on X_val
;
- an optimal threshold value is found on
X_val
by optimizing a metric of choice (fbeta_score
in our case).
When calling predict
: probabilities for the 1's class are generated and cast into binary classes by the optimal threshold value found.
model = ThresholdClassifier(RandomForestClassifier()).fit(X_train, y_train)
pred_clas = model.predict(X_test)
ThresholdClassifier
can be used with any sklearn classifier which produces probabilities. It can be easily customized according to different needs. It's very useful in conjunction with GridSearchCV
/RandomSearchCV
to connect the search of parameters with the tune of a classification threshold.
model = RandomizedSearchCV(
ThresholdClassifier(RandomForestClassifier()),
dict(n_estimators=stats.randint(50,300)),
n_iter=20, random_state=1234,
cv=5, n_jobs=-1,
).fit(X_train, y_train)