How can I use a custom feature selection function in scikit-learn's `pipeline`
Asked Answered
A

6

15

Let's say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by using the pipeline class.

For example, if I want to experiment with PCA vs LDA I could do something like:

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classification', GaussianNB())   
    ])

clf_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classification', GaussianNB())   
    ])

clf_lda = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('reduce_dim', LDA(n_components=2)),
    ('classification', GaussianNB())   
    ])

# Constructing the k-fold cross validation iterator (k=10)  

cv = KFold(n=X_train.shape[0],  # total number of samples
           n_folds=10,           # number of folds the dataset is divided into
           shuffle=True,
           random_state=123)

scores = [
    cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
            for clf in [clf_all, clf_pca, clf_lda]
    ]

But now, let's say that -- based on some "domain knowledge" -- I have the hypothesis that the features 3 & 4 might be "good features" (the third and fourth column of the array X_train) and I want to compare them with the other approaches.

How would I include such a manual feature selection in the pipeline?

For example

def select_3_and_4(X_train):
    return X_train[:,2:4]

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('feature_select', select_3_and_4),           
    ('classification', GaussianNB())   
    ]) 

would obviously not work.

So I assume I have to create a feature selection class that has a transform dummy method and fit method that returns the two columns of the numpy array?? Or is there a better way?

Agile answered 11/8, 2014 at 19:11 Comment(3)
I know this is an old post, but for anyone who see this, they should note that LDA is a classifier, rather than a transformer, and so its use in this example is not appropriate.Kurus
@Kurus you are wrong, it's perfectly appropriate. LDA is both a classifier and a transformer, fitted LDA model can also be used to reduce the dimensionality of the input by projecting it to the most discriminative directions, using the transform method, which the topic started does, exactly.Coquina
Good catch @AnatolyAlekseev I didn’t realise SKLearn LDA implemented transform. Side note this example might still be a bit redundant as LDA defaults to PCA for dimensionality reductionKurus
P
3

If you want to use the Pipeline object, then yes, the clean way is to write a transformer object. The dirty way to do this is

select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4

and use select_3_and_4 as you had it in your pipeline. You can evidently also write a class.

Otherwise, you could also just give X_train[:, 2:4] to your pipeline if you know that the other features are irrelevant.

Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression with e.g. k=2 in your case.

Planck answered 11/8, 2014 at 20:20 Comment(1)
Thank you. I meanwhile experimented with writing a class that has a transform method and it seems to workAgile
A
28

I just want to post my solution for completeness, and maybe it is useful to one or the other:

class ColumnExtractor(object):

    def transform(self, X):
        cols = X[:,2:4] # column 3 and 4 are "extracted"
        return cols

    def fit(self, X, y=None):
        return self

Then, it can be used in the Pipeline like so:

clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', ColumnExtractor()),           
    ('classification', GaussianNB())   
    ])

EDIT: General solution

And for a more general solution ,if you want to select and stack multiple columns, you can basically use the following Class as follows:

import numpy as np

class ColumnExtractor(object):

    def __init__(self, cols):
        self.cols = cols

    def transform(self, X):
        col_list = []
        for c in self.cols:
            col_list.append(X[:, c:c+1])
        return np.concatenate(col_list, axis=1)

    def fit(self, X, y=None):
        return self

    clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('dim_red', ColumnExtractor(cols=(1,3))),   # selects the second and 4th column      
    ('classification', GaussianNB())   
    ])
Agile answered 11/8, 2014 at 23:50 Comment(3)
So, implementing fit and transform is enough to have a new feature transformation step that can be added into the pipeline?Unni
Yes, that's all you needAgile
how about if the extractor needs a parameter ? how would you add set_param ?Shaft
I
5

Adding on Sebastian Raschka's and eickenberg's answers, the requirements a transformer object should hold are specified in scikit-learn's documentation.

There are several more requirements than just having fit and transform, if you want the estimator to usable in parameter estimation, such as implementing set_params.

Innerdirected answered 22/1, 2015 at 14:10 Comment(3)
the recommended way to implement set_params though is to inherit it from BaseEstimator, e.g. by defining your class with the statement class my_class(TransformerMixin, BaseEstimator). Don't go write your own set_params method, unless you're really sure you need to.Undersized
BaseEstimator is unknown ! can you edit the answer with your comment regarding set_params ?Shaft
@user702846 I didn't get your intention. The documentation for BaseEstimator is available here - scikit-learn.org/stable/modules/generated/…Innerdirected
S
4

You can use the following custom transformer to select the columns specified:

#Custom Transformer that extracts columns passed as an argument to its constructor

class FeatureSelector( BaseEstimator, TransformerMixin ):

    #Class Constructor 
    def __init__( self, feature_names ):
        self._feature_names = feature_names 

    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        return self 

    #Method that describes what we need this transformer to do
    def transform( self, X, y = None ):
        return X[ self._feature_names ]`

Here feature_names is the list of features which you want to select For more details, you can refer to this link [1]: https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

Sarcenet answered 7/12, 2020 at 8:12 Comment(4)
Thanks. This looks like an updated solution using the latest packages since the question is from 2014 but still very relevant today. Is it possible to add in the full end to end solution so that the answer is self contained? E.g. add the pipeline calls, imports, etc.Angers
Thanks for this. I noticed that the built-in feature selectors inherit from a _BaseFilter which inherits from SelectorMixin, BaseEstimator. Should this custom selector do the same? Not sure it matters.Armageddon
I noticed that the built-in estimators save feature names in an attribute called model.feature_names_in_. Should we use that name instead of _feature_names for consistency?Armageddon
Regarding my second comment above—I am mistaken. The feature_names_in_ attribute is an array of the features at the input to the feature selector (not the output). So we need an additional attribute to store the selected features as you propose. Although, for clarity, maybe this should be called feature_names_out_.Armageddon
P
3

If you want to use the Pipeline object, then yes, the clean way is to write a transformer object. The dirty way to do this is

select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4

and use select_3_and_4 as you had it in your pipeline. You can evidently also write a class.

Otherwise, you could also just give X_train[:, 2:4] to your pipeline if you know that the other features are irrelevant.

Data driven feature selection tools are maybe off-topic, but always useful: Check e.g. sklearn.feature_selection.SelectKBest using sklearn.feature_selection.f_classif or sklearn.feature_selection.f_regression with e.g. k=2 in your case.

Planck answered 11/8, 2014 at 20:20 Comment(1)
Thank you. I meanwhile experimented with writing a class that has a transform method and it seems to workAgile
S
2

I didn't find the accepted answer very clear, so here is my solution for others. Basically, the idea is making a new class based on BaseEstimator and TransformerMixin

The following is a feature selector based on percentage of NAs within a column. The perc value corresponds to the percentage of NAs.

from sklearn.base import TransformerMixin, BaseEstimator

class NonNAselector(BaseEstimator, TransformerMixin):

    """Extract columns with less than x percentage NA to impute further
    in the line
    Class to use in the pipline
    -----
    attributes 
    fit : identify columns - in the training set
    transform : only use those columns
    """

    def __init__(self, perc=0.1):
        self.perc = perc
        self.columns_with_less_than_x_na_id = None

    def fit(self, X, y=None):
        self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
        return self

    def transform(self, X, y=None, **kwargs):
        return X[self.columns_with_less_than_x_na_id]

    def get_params(self, deep=False):
        return {"perc": self.perc}
Shaft answered 20/12, 2018 at 14:1 Comment(0)
D
2

Another way is to simply use the ColumnTransformer with an «empty» FunctionTransformer:

# a FunctionTransformer with func=None yields the identity function / passthrough 
empty_func = make_pipeline(FunctionTransformer(func=None)) 

clf_all = make_pipeline(StandardScaler(), 
                        ColumnTransformer([("select", empty_func, [3, 4])]),
                        GaussianNB(),
                        )

This works because the ColumnTransformer by default drops the remainder of columns that aren't selected.

EDIT:

Like Bill suggested in the comments, better to use "passthrough" to select specific columns or "drop" to drop specific columns.

pipe = make_pipeline(StandardScaler(),
                     ColumnTransformer([("select", "passthrough", [2, 3])]), 
                     GaussianNB())
Deceptive answered 8/6, 2021 at 7:14 Comment(1)
I think you can actually specify 'passthrough' and 'drop' in place of an empty transformer now. See this answer.Armageddon

© 2022 - 2024 — McMap. All rights reserved.