This post is about the differences between LogisticRegressionCV, GridSearchCV and cross_val_score. Consider the following setup:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, GridSearchCV, \
StratifiedKFold, cross_val_score
from sklearn.metrics import confusion_matrix
read = load_digits()
X, y = read.data, read.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
In penalized logistic regression, we need to set the parameter C which controls regularization. There are 3 ways in scikit-learn to find the best C by cross validation.
LogisticRegressionCV
clf = LogisticRegressionCV (Cs = 10, penalty = "l1",
solver = "saga", scoring = "f1_macro")
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))
Side note: The documentation states that SAGA and LIBLINEAR are the only optimizers for L1 penalty, and SAGA is faster for large datasets. Unfortunately, warm starting is available for Newton-CG and LBFGS only.
GridSearchCV
clf = LogisticRegression (penalty = "l1", solver = "saga", warm_start = True)
clf = GridSearchCV (clf, param_grid = {"C": np.logspace(-4, 4, 10)}, scoring = "f1_macro")
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))
result = clf.cv_results_
cross_val_score
cv_scores = {}
for val in np.logspace(-4, 4, 10):
clf = LogisticRegression (C = val, penalty = "l1",
solver = "saga", warm_start = True)
cv_scores[val] = cross_val_score (clf, X_train, y_train,
cv = StratifiedKFold(), scoring = "f1_macro").mean()
clf = LogisticRegression (C = max(cv_scores, key = cv_scores.get),
penalty = "l1", solver = "saga", warm_start = True)
clf.fit(X_train, y_train)
confusion_matrix(y_test, clf.predict(X_test))
Questions
- Have I performed cross validation correctly in 3 ways?
- Are all 3 ways equivalent? If not, can they be made equivalent by changing the code?
- Which way is the best in terms of elegance, speed or any criteria? (In other words, why are there 3 ways of cross validation in scikit-learn?)
Non-trivial answers to any one question are welcome; I realize they are a bit long but they are hopefully a good summary of hyperparameter selection in scikit-learn.