There are various methods for doing automated feature selection in Scikit-learn.
E.g.
my_feature_selector = SelectKBest(score_func=f_regression, k=3)
my_feature_selector.fit_transform(X, y)
The selected features are then retrievable using
feature_idx = my_feature_selector.get_support(indices=True)
feature_names = X.columns[feature_idx]
(Note, in my case X
and y
are Pandas dataframes with named columns).
They are also saved as an attribute of a fitted model:
feature_names = my_model.feature_names_in_
However, I want to build a pipeline with a manual (i.e. pre-specified) set of features.
Obviously, I could manually select the features from the full data-set every time I do training or prediction:
model1_feature_names = ['MedInc', 'AveRooms', 'Latitude']
model1.fit(X[model1_feature_names], y)
y_pred1 = model1.predict(X[model1.feature_names_in_])
But I want a more convenient way to construct different models (or pipelines) each of which uses a potentially different set of (manually specified) features. Ideally, I would specify the feature_names_in_
as an initialization parameter so that later I don't have to worry about transforming the data and can run my model (or pipeline) on the full data set as follows:
model1.fit(X, y) # uses a pre-defined sub-set of features in X
model2.fit(X, y) # uses a different sub-set of features
y_pred1 = model1.predict(X)
y_pred2 = model2.predict(X)
Do I need to build a custom feature selector to do this? Surely there's an easier way.
I guess I was expecting to find something like a built-in FeatureSelector
class that I could use in a pipeline as follows:
my_feature_selector1 = FeatureSelector(feature_names=['MedInc', 'AveRooms', 'Latitude'])
my_feature_selector1.fit_transform(X, y) # This would do nothing
pipe1 = Pipeline([('feature_selector', my_feature_selector1), ('model', LinearRegression())])