GridSearchCV: How to specify test set?

Asked 11/11, 2016 at 10:37 Answered 5/12, 2018 at 21:3

Solved python scikit-learn cross-validation text-classification

I have a question regarding GridSearchCV:

by using this:

gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=6, scoring="f1")

I specify that k-fold cross-validation should be used with 6 folds right?

So that means that my corpus is split into training set and tet set 6 times.

Doesn't that mean that for the GridSearchCV I need to use my entire corpus, like so:

gs_clf = gs_clf.fit(corpus.data, corpus.target)

And if so, how would I then get my trainig set from there used for the predict method?

predictions = gs_clf.predict(??)

I have seen code where the corpus is split into test set and training set using train_test_split and then X_train and Y_train are passed to gs_clf.fit.

But that doesn't make sense to me: If I split it the corpus beforehand, why use cross validation again in the GridSearchCV?

Thanks for some clarification!!

Pathological answered 11/11, 2016 at 10:37 Comment(1)

This is a great question; thanks for putting it out there! – Jeweljeweler 27/9, 2020 at 1:8

GridSearchCV is not designed for measuring the performance of your model but to optimize the hyper-parameter of classifier while training. And when you write gs_clf.fit you are actually trying different models on your entire data (but different folds) in the pursuit of the best hyper-parameter. For example, if you have n different c's and m different gamma's for an SVM model, then you have n X m models and you are searching (grid-search) through them to see which one works best on your data.
When you found the best model using gs_clf.best_params_, then you can use your test data to get the actual performance (e.g., accuracy, precision, ...) of your model.
Of course, only then it is time for testing the model. Your test data must not have any overlap with the data you trained your model against. For instance, you should have something like corpus.train and corpus.test, and you should reserve corpus.test only for the last round when you are done with training and you only want to test the final model.

As we all know, any use of test data in the process of training the model (where training data should be used) or tuning the hyper-parameters (where the validation data should be used) is considered cheating and results in unrealistic performance.

Romans answered 5/12, 2018 at 21:3 Comment(0)

-1

Cross-validation and test percentile are different ways to measure the algorithm accuracy. Cross-validation does what you have said. Then, you must give all the data to the classifier. Splitting the data when using cross-validation makes simply no sense.

If you want to measure precision or recall using GridSearchCV, you must create a scorer and assign it to the scoring parameter of GridSearchCV, like in this example:

>>> from sklearn.metrics import fbeta_score, make_scorer
>>> ftwo_scorer = make_scorer(fbeta_score, beta=2)
>>> from sklearn.model_selection import GridSearchCV
>>> from sklearn.svm import LinearSVC
>>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

Leveloff answered 11/11, 2016 at 11:3 Comment(6)

But isnt that what's happening here? civisanalytics.com/blog/data-science/2016/01/06/…. And If I give all the data to the classifier, what do I use for the predict function? – Pathological 11/11, 2016 at 12:5

Yes, it seems to be that, and is used when they create a report to see accuracy by precision and recall. By the way, if they do this measurement in this way, dont make sense to use cross validation. Is some kind of educational mixing, but with no sense in real world. You must use the predict function when you have a new input without label and you want to predict the label for this input, im saying, when you want to make predictions. – Leveloff 11/11, 2016 at 12:21

Okay, so use the entire corpus and use predict for new data. And how would I get precision, accuracy and recall from the GridSearchCV? I cant tell how to do this from the sklearn documentation. – Pathological 11/11, 2016 at 13:0

You must create an scorer and assign it to the scoring parameter of gridsearchcv, like in this example:

>>> from sklearn.metrics import fbeta_score, make_scorer >>> ftwo_scorer = make_scorer(fbeta_score, beta=2) >>> from sklearn.model_selection import GridSearchCV >>> from sklearn.svm import LinearSVC >>> grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

– Leveloff 11/11, 2016 at 13:12

I am using the f1 parameter, like described here scikit-learn.org/stable/modules/…. But thats just to define the scoring, right? How do I get the actual values for accuracy, recall and precison? – Pathological 11/11, 2016 at 13:44

Basically I would like a classification_report, but using the GridSearchCV and not the predictions afterwards. – Pathological 11/11, 2016 at 14:11

Recommended topics

Hot tags