My goal is to use one model to select the most important variables and another model to use those variables to make predictions. In the example below I am using two instances of RandomForestClassifier
, but the second model could be any other classifier.
The RF has a transform method with a threshold argument. I would like to grid search over different possible threshold arguments.
Here is a simplified code snippet:
# Transform object and classifier
rf_filter = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42, oob_score=False)
clf = RandomForestClassifier(n_jobs=-1, random_state=42, oob_score=False)
pipe = Pipeline([("RFF", rf_filter), ("RF", clf)])
# Grid search parameters
rf_n_estimators = [10, 20]
rff_transform = ["median", "mean"] # Search the threshold parameters
estimator = GridSearchCV(pipe,
cv = 3,
param_grid = dict(RF__n_estimators = rf_n_estimators,
RFF__threshold = rff_transform))
estimator.fit(X_train, y_train)
The error is ValueError: Invalid parameter threshold for estimator RandomForestClassifier
I thought this would work because the docs say:
If None and if available, the object attribute threshold is used.
I tried setting the threshold attribute before the grid search (rf_filter.threshold = "median"
) and it worked; however, I couldn't figure out how to then grid search over it.
Is there a way to iterate over different arguments that would normally be expected to be provided within the transform method of a classifier?