Compare ways to tune hyperparameters in scikit-learn

import numpy as np from sklearn.datasets import load_digits from sklearn.linear_model import LogisticRegression, LogisticRegressionCV from sklearn.model_selection import train_test_split, GridSearchCV, \ StratifiedKFold, cross_val_score from sklearn.metrics import confusion_matrix read = load_digits() X, y = read.data, read.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

clf = LogisticRegression (penalty = "l1", solver = "saga", warm_start = True) clf = GridSearchCV (clf, param_grid = {"C": np.logspace(-4, 4, 10)}, scoring = "f1_macro") clf.fit(X_train, y_train) confusion_matrix(y_test, clf.predict(X_test)) result = clf.cv_results_

cv_scores = {} for val in np.logspace(-4, 4, 10): clf = LogisticRegression (C = val, penalty = "l1", solver = "saga", warm_start = True) cv_scores[val] = cross_val_score (clf, X_train, y_train, cv = StratifiedKFold(), scoring = "f1_macro").mean() clf = LogisticRegression (C = max(cv_scores, key = cv_scores.get), penalty = "l1", solver = "saga", warm_start = True) clf.fit(X_train, y_train) confusion_matrix(y_test, clf.predict(X_test))

Regarding 3 - Why are there 3 ways of cross validation in scikit-learn?

Lets look at this in analogy to clustering: Multiple clustering algorithms are implemented in scikit-learn.

Why so? Is not one better than the other?

You might answer: Well they are different algorithms each with their own advantages and disadvantages.

LogisticRegressionCV

implements Logistic Regression with built-in cross-validation support, to find the optimal C and l1_ratio parameters according to the scoring attribute.

LogisticRegressionCV is thus an "advanced" version of Logistic Regression since it does not require the user to optimize the hyperparameters C l1_ratio himself.

GridSearchCV

The user guide states that:

The grid search provided by GridSearchCV exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. For instance, the following

param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

Here you can actually specify both the parameters over which you want to do grid search as well as the values/steps. Compared with LogisticRegressionCV, the main difference is that GridSearchCV can be used for any classifier/regressor. Most important, you can also use GridSearchCV for any models that are not on sklearn, as long as they have both fit and predict methods.

In addition to providing the model that performed best by using such as:

clf = GridSearchCV (clf, param_grid = {"C": np.logspace(-4, 4, 10)}, scoring = "f1_macro")
clf.fit(X_train, y_train)

GridSearchCV also contains an extensive evaluation of the best model:

cv_results_ : dict of numpy (masked) ndarrays A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

cross_val_score

You might want to evaluate your model specifically on a holdout dataset. Without search over parameters, you evaluate a single model. This is when you use cross_val_score.

TLDR: All are different methods and each are used for a different purpose. LogisticRegressionCV is only relevant for logistic regression. GridSearchCV is the most exhaustive and generalized variant which includes both evaluation scores as well as the optimal classifier. cross_val_score is only an evaluation and preferred to use when only evaluating.

LogisticRegressionCV

GridSearchCV

cross_val_score

Questions

Recommended topics

Hot tags