Like the traceback says: each step in your pipeline needs to have a fit()
and transform()
method (except the last, which just needs fit()
. This is because a pipeline chains together transformations of your data at each step.
sel.transform(train_cv_x)
is not an estimator and doesn't meet this criterion.
In fact, it looks like based on what you're trying to do, you can leave this step out. Internally, ('sel', sel)
already does this transformation--that's why it's included in the pipeline.
Secondly, ExtraTreesClassifier
(the first step in your pipeline), doesn't have a transform()
method, either. You can verify that here, in the class docstring. Supervised learning models aren't made for transforming data; they're made for fitting on it and predicting based off that.
What type of classes are able to do transformations?
Without reading between the lines too much about what you're trying to do here, this would work for you:
- First split x and y using
train_test_split
. The test dataset produced by this is held out for final testing, and the train dataset within GridSearchCV
's cross-validation will be further broken out into smaller train and validation sets.
- Build a pipeline that satisfies what your traceback is trying to tell you.
- Pass that pipeline to
GridSearchCV
, .fit()
that grid search on X_train/y_train, then .score()
it on X_test/y_test.
Roughly, that would look like this:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=444)
sel = SelectFromModel(ExtraTreesClassifier(n_estimators=10, random_state=444),
threshold='mean')
clf = RandomForestClassifier(n_estimators=5000, random_state=444)
model = Pipeline([('sel', sel), ('clf', clf)])
params = {'clf__max_features': ['auto', 'sqrt', 'log2']}
gs = GridSearchCV(model, params)
gs.fit(X_train, y_train)
# How well do your hyperparameter optimizations generalize
# to unseen test data?
gs.score(X_test, y_test)
Two examples for further reading: