I have a question regarding GridSearchCV
:
by using this:
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=6, scoring="f1")
I specify that k-fold cross-validation should be used with 6 folds right?
So that means that my corpus is split into training set and tet set 6 times.
Doesn't that mean that for the GridSearchCV
I need to use my entire corpus, like so:
gs_clf = gs_clf.fit(corpus.data, corpus.target)
And if so, how would I then get my trainig set from there used for the predict method?
predictions = gs_clf.predict(??)
I have seen code where the corpus is split into test set and training set using train_test_split
and then X_train
and Y_train
are passed to gs_clf.fit
.
But that doesn't make sense to me: If I split it the corpus beforehand, why use cross validation again in the GridSearchCV
?
Thanks for some clarification!!