Nested cross-validation example on Scikit-learn
Asked Answered
F

3

20

I'm trying to work my head around the example of Nested vs. Non-Nested CV in Sklearn. I checked multiple answers but I am still confused on the example. To my knowledge, a nested CV aims to use a different subset of data to select the best parameters of a classifier (e.g. C in SVM) and validate its performance. Therefore, from a dataset X, the outer 10-folds CV (for simplicity n=10) creates 10 training sets and 10 test sets:

(Tr0, Te0),..., (Tr0, Te9)

Then, the inner 10-CV splits EACH outer training set into 10 training and 10 test sets:

From Tr0: (Tr0_0,Te_0_0), ... , (Tr0_9,Te0_9)
From Tr9: (Tr9_0,Te_9_0), ... , (Tr9_9,Te9_9)

Now, using the inner CV, we can find the best values of C for every single outer Training set. This is done by testing all the possible values of C with the inner CV. The value providing the highest performance (e.g. accuracy) is chosen for that specific outer Training set. Finally, having discovered the best C values for every outer Training set, we can calculate an unbiased accuracy using the outer Test sets. With this procedure, the samples used to identify the best parameter (i.e. C) are not used to compute the performance of the classifier, hence we have a totally unbiased validation.

The example provided in the Sklearn page is:

inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_scores[i] = clf.best_score_

# Nested CV with parameter optimization
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()

From what I understand, the code simply calculates the scores using two different cross-validations (i.e. different splits into training and test set). Both of them used the entire dataset. The GridCV identifies the best parameters using one (of the two CVs), then cross_val_score calculates, with the second CV, the performance when using the best parameters.

Am I interpreting a Nested CV in the wrong way? What am I missing from the example?

Fatback answered 6/10, 2017 at 10:18 Comment(7)
You can take a look at my answer here to get a step by step analysis.Soluble
I got really confused by the names and the order, as I expected outer_cv to be used "before" inner_cv. So, the nesting occurs because we pass clf, that is an instance of GridSearchCV to cross_val_scor? Hence, in simple words, cross_val_score, first split X into X_tr, X_te, then X_tr is passed to clf that, because is an instance of GridSearchCV, will further split X_tr, into X_tr_tr and X_tr_te?Fatback
Yes, you are correct. One X_tr is splitted into X_tr_tr and X_tr_te for number of folds defined in inner_cv. Then according to the outer_cv some other part of the data becomes X_tr which is then again sent to inner_cv. Hope it makes sense.Soluble
Yes, thanks Vivek. So, if we directly pass clf = SVM() to cross_val_score we obtain a "traditional" cross-fold validation.Fatback
Yes, the whole nested cross-validation takes place because of cross-validation done in the GridSearchCV. If using simple estimators, this becomes the simple cross-validationSoluble
Does this answer your question? scikit-learn GridSearchCV with multiple repetitionsHoyos
Does this answer your question? Confusing example of nested cross validation in scikit-learnDolora
V
0

In the Sklearn example, there are two cross-validation loops:

  1. Inner CV (GridSearchCV): This is where the model's hyperparameters are tuned. It uses the GridSearchCV function, which essentially performs an exhaustive search over specified parameter values for an estimator. In this case, it's SVM with parameters defined in p_grid. This inner CV is used to find the best hyperparameters (in this case, C for SVM) using a separate subset of data (inner_cv).
  2. Outer CV (cross_val_score): This is the outer loop that evaluates the model's performance. It uses the cross_val_score function to perform cross-validation (outer_cv) on the entire dataset, but crucially, it fits the model using the best parameters found in the inner CV.

So, you're correct in observing that the example uses two different cross-validation splits, but the crucial aspect is that the hyperparameters are tuned using one split (inner CV) and the model performance is evaluated using another split (outer CV), which ensures a more reliable estimate of the model's performance.

Voronezh answered 11/4, 2024 at 5:57 Comment(0)
D
0

When I first read the documentation and most of the answers, I got very confused. But now I think I get it, so I will try to explain what I get, hope it helps :)

inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

This part is quite straightforward. In the lines above, we are creating two cross validators called inner_cv and outer_cv.

# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_scores[i] = clf.best_score_

This part is quite straightforward again. We ignore the outer_cv cross validator and only use the inner_cv cross validator. We are basically using GridSearchCV to find optimal hyperparameters for the whole dataset based on the inner_cv cross validator. These optimal hyperparameters are generated in the first line. Then in the second line of code, we are fitting the optimal model to the whole dataset, again by splitting using inner_cv. In the last step we are getting the scores from the fitted model.

This is non-nested cross validation. Since, both the hyperparameter tuning and the performance calculation are done using the same data split, there might be some data leakage as the model has already seen all the data before being fitted (while the hyperparameters are tuned). So, what the webpage suggests is to use the following:

# Nested CV with parameter optimization
clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()

In the first line of code here, we are instantiating the GridSearchCV object using the inner_cv cross validator (but not fitting it). In the second line, we are doing a lot of things. First, using cross_val_score and outer_cv we break the initial data into different splits, let's call it x_tr_0, x_ts_0, x_tr_1, x_ts_1, x_tr_2, x_ts_2, x_tr_3, x_ts_3 (since there are four splits, and in each split, we have training and testing data). The training data is passed on to the GridSearchCV method in each fold. So, the inner_cv cross validator works on the training data splits from the outer_cv cross validator. So, in the GridSearchCV method, we are basically breaking down x_tr_0 to x_tr_0_0, x_tr_0_1, x_tr_0_2, x_tr_0_3 splits. In these "inner" splits we are doing the hyperparameter tuning.

Once the optimal hyperparameters are calculated, we use these in the "outer" split for calculating the model performance. In this case, the model evaluation is done on the outer_cv split test data (which is unseen by the hyperparameter tuning "inner" split). This ensures that the performance values we are getting are more generalizable and there is no data leakage.

Dolora answered 28/5, 2024 at 12:18 Comment(0)
B
-2

Your understanding of nested cross-validation (CV) is correct, but it seems there might be a misunderstanding in how the example code is structured.

In the example provided by scikit-learn, nested cross-validation is indeed being utilized. Let's break down the code to understand how it fits into the concept of nested CV:

Initialization of CV Splits:

inner_cv = KFold(n_splits=4, shuffle=True, random_state=i)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=i)

Two sets of cross-validation iterators are initialized: inner_cv and outer_cv. Both use 4-fold cross-validation, but they serve different purposes.

Non-Nested Parameter Search and Scoring:

clf = GridSearchCV(estimator=svm, param_grid=p_grid, cv=inner_cv)
clf.fit(X_iris, y_iris)
non_nested_scores[i] = clf.best_score_

In this part, GridSearchCV is used with the inner cross-validation (inner_cv). It searches over the parameter grid p_grid for the best hyperparameters using cross-validation. The best model found is then trained on the entire dataset (X_iris, y_iris) and its performance score (e.g., accuracy) is recorded.

Nested CV with Parameter Optimization:

nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
nested_scores[i] = nested_score.mean()

Here, cross_val_score is used with the outer cross-validation (outer_cv). The clf model (which was trained using the best parameters found during the inner CV) is evaluated using cross-validation. This process is repeated for each fold in the outer CV. Finally, the mean of the scores obtained across the folds is calculated and recorded.

So, in summary, the nested cross-validation in this example involves:

Inner CV: Used for hyperparameter tuning (GridSearchCV) to find the best model. Outer CV: Used to evaluate the performance of the best model found by the inner CV. This structure ensures that the evaluation of the model's performance is unbiased, as the hyperparameters are optimized on a separate set of data within each fold of the outer CV.

Your interpretation seems to be correct, but the example does indeed showcase the use of nested cross-validation. Each fold in the outer CV is used to assess the generalization performance of the model trained on the corresponding training data, and the hyperparameters are selected using the inner CV to avoid data leakage and overfitting.

Brittaney answered 28/3, 2024 at 7:26 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.