Put customized functions in Sklearn pipeline

D

3

27

In my classification scheme, there are several steps including:

SMOTE (Synthetic Minority Over-sampling Technique)
Fisher criteria for feature selection
Standardization (Z-score normalisation)
SVC (Support Vector Classifier)

The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning.

The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))]) and breaks the scheme into two parts:

Tune the percentile of features to keep through the first grid search

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for percentile in percentiles:
        # Fisher returns the indices of the selected features specified by the parameter 'percentile'
        selected_ind = Fisher(X_train, y_train, percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

The f1 scores will be stored and then be averaged through all fold partitions for all percentiles, and the percentile with the best CV score is returned. The purpose of putting 'percentile for loop' as the inner loop is to allow fair competition as we have the same training data (including synthesized data) across all fold partitions for all percentiles.

After determining the percentile, tune the hyperparameters by second grid search

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for parameters in parameter_comb:
        # Select the features based on the tuned percentile
        selected_ind = Fisher(X_train, y_train, best_percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

It is done in the very similar way, except we tune the hyperparamter for SVC rather than percentile of features to select.

My questions are:

In the current solution, I only involve 3. and 4. in the clf and do 1. and 2. kinda "manually" in two nested loop as described above. Is there any way to include all four steps in a pipeline and do the whole process at once?
If it is okay to keep the first nested loop, then is it possible (and how) to simplify the next nested loop using a single pipeline
```
clf_all = Pipeline([('smote', SMOTE()),
                    ('fisher', Fisher(percentile=best_percentile))
                    ('normal',preprocessing.StandardScaler()),
                    ('svc',svm.SVC(class_weight='auto'))]) 
```
and simply use GridSearchCV(clf_all, parameter_comb) for tuning?

Please note that both SMOTE and Fisher (ranking criteria) have to be done only for the training data in each fold partition.

It would be so much appreciated for any comment.

SMOTE and Fisher are shown below:

def Fscore(X, y, percentile=None):
    X_pos, X_neg = X[y==1], X[y==0]
    X_mean = X.mean(axis=0)
    X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
    num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    F = num/deno
    sort_F = argsort(F)[::-1]
    n_feature = (float(percentile)/100)*shape(X)[1]
    ind_feature = sort_F[:ceil(n_feature)]
    return(ind_feature)

SMOTE is from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py, it returns the synthesized data. I modified it to return the original input data stacked with the synthesized data along with its labels and synthesized ones.

def smote(X, y):
    n_pos = sum(y==1), sum(y==0)
    n_syn = (n_neg-n_pos)/float(n_pos) 
    X_pos = X[y==1]
    X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
    y_syn = np.ones(shape(X_syn)[0])
    X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
    return(X, y)

Doi answered 7/7, 2015 at 4:44 Comment(0)

S

17

I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.

EDIT:

I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.

>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>> 
>>> class Fisher(BaseEstimator, TransformerMixin):
...     def __init__(self,percentile=0.95):
...             self.percentile = percentile
...     def fit(self, X, y):
...             from numpy import shape, argsort, ceil
...             X_pos, X_neg = X[y==1], X[y==0]
...             X_mean = X.mean(axis=0)
...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
...             F = num/deno
...             sort_F = argsort(F)[::-1]
...             n_feature = (float(self.percentile)/100)*shape(X)[1]
...             self.ind_feature = sort_F[:ceil(n_feature)]
...             return self
...     def transform(self, x):
...             return x[self.ind_feature,:]
... 
>>> 
>>> data = load_iris()
>>> 
>>> pipeline = Pipeline([
...     ('fisher', Fisher()),
...     ('normal',StandardScaler()),
...     ('svm',SVC(class_weight='auto'))
... ])
>>> 
>>> grid = {
...     'fisher__percentile':[0.75,0.50],
...     'svm__C':[1,2]
... }
>>> 
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.

Sulfamerazine answered 8/7, 2015 at 17:36 Comment(6)

Thank you, I included both functions in the OP. – Doi 9/7, 2015 at 6:13

See the edit, sorry for jumping the gun but I don't think it's possible since your functions need to be applied to your target. – Sulfamerazine 9/7, 2015 at 17:27

Sorry for late response. I am wondering what did you mean by "Fisher process which would work if the function itself did not need to affect your target variable." Fisher score here takes targets (i.e., y) as input and make transformed x as output, which seems to me that it doesn't transform y. – Doi 18/11, 2015 at 9:52

I don't really remember this, but it looks like I just carbon copied your code. Is the goal to select columns from X or sample rows? If it's the former then I believer there was a bug in your code and this should work once fixed but if it's the later then that does have an impact on y (because y then needs to be sampled too). – Sulfamerazine 18/11, 2015 at 22:10

Thanks for taking care. It is the former. Fisher score takes X and y as inputs and calculate the ratio of between- and within-variance for each feature (column) using the info. of labels, and the features are sorted based on the ratio. Finally the features are selected given a desired percentage of top features. – Doi 19/11, 2015 at 7:3

I found the bug in the transform method, it should be x[:,self.ind_feature] because we are filtering the features (columns) rather than samples :) – Doi 19/11, 2015 at 16:16

U

21

scikit created a FunctionTransformer as part of the preprocessing class in version 0.17. It can be used in a similar manner as David's implementation of the class Fisher in the answer above - but with less flexibility. If the input/output of the function is configured properly, the transformer can implement the fit/transform/fit_transform methods for the function and thus allow it to be used in the scikit pipeline.

For example, if the input to a pipeline is a series, the transformer would be as follows:


def trans_func(input_series):
    return output_series

from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(trans_func)

sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
sk_pipe.fit(train.desc, train.tag)

where vect is a tf_idf transformer, clf is a classifier and train is the training dataset. "train.desc" is the series text input to the pipeline.

Unmentionable answered 14/1, 2020 at 23:45 Comment(1)

This is a much cleaner answer than the accepted one. Thanks! – Coreycorf 5/7, 2021 at 20:27

S

17

I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.

EDIT:

I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.

>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>> 
>>> class Fisher(BaseEstimator, TransformerMixin):
...     def __init__(self,percentile=0.95):
...             self.percentile = percentile
...     def fit(self, X, y):
...             from numpy import shape, argsort, ceil
...             X_pos, X_neg = X[y==1], X[y==0]
...             X_mean = X.mean(axis=0)
...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
...             F = num/deno
...             sort_F = argsort(F)[::-1]
...             n_feature = (float(self.percentile)/100)*shape(X)[1]
...             self.ind_feature = sort_F[:ceil(n_feature)]
...             return self
...     def transform(self, x):
...             return x[self.ind_feature,:]
... 
>>> 
>>> data = load_iris()
>>> 
>>> pipeline = Pipeline([
...     ('fisher', Fisher()),
...     ('normal',StandardScaler()),
...     ('svm',SVC(class_weight='auto'))
... ])
>>> 
>>> grid = {
...     'fisher__percentile':[0.75,0.50],
...     'svm__C':[1,2]
... }
>>> 
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.

Sulfamerazine answered 8/7, 2015 at 17:36 Comment(6)

Thank you, I included both functions in the OP. – Doi 9/7, 2015 at 6:13

See the edit, sorry for jumping the gun but I don't think it's possible since your functions need to be applied to your target. – Sulfamerazine 9/7, 2015 at 17:27

Sorry for late response. I am wondering what did you mean by "Fisher process which would work if the function itself did not need to affect your target variable." Fisher score here takes targets (i.e., y) as input and make transformed x as output, which seems to me that it doesn't transform y. – Doi 18/11, 2015 at 9:52

I don't really remember this, but it looks like I just carbon copied your code. Is the goal to select columns from X or sample rows? If it's the former then I believer there was a bug in your code and this should work once fixed but if it's the later then that does have an impact on y (because y then needs to be sampled too). – Sulfamerazine 18/11, 2015 at 22:10

Thanks for taking care. It is the former. Fisher score takes X and y as inputs and calculate the ratio of between- and within-variance for each feature (column) using the info. of labels, and the features are sorted based on the ratio. Finally the features are selected given a desired percentage of top features. – Doi 19/11, 2015 at 7:3

I found the bug in the transform method, it should be x[:,self.ind_feature] because we are filtering the features (columns) rather than samples :) – Doi 19/11, 2015 at 16:16

B

1

You actually can put all of these functions into a single pipeline!

In the accepted answer, @David wrote that your functions

transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were.

It is true that sklearn's pipeline does not support this. However imblearn's pipeline here supports this. The imblearn pipeline is just like that of sklearn but it allows you to call transformations separately on the training and testing data via sample methods. Moreover, these sample methods are actually designed so that you can change both the data X and the labels y. This is important because many times you want to include smote in your pipeline but you want to smote just the training data, not the testing data. And with the imblearn pipeline, you can call smote in the pipeline to transform just X_train and y_train and not X_test and y_test.

So you can create an imblearn pipeline that has a smote sampler, pre-processing step, and svc.

For more details check out this stack overflow post here and machine learning mastery article here.

Brunel answered 15/12, 2021 at 2:11 Comment(0)

Recommended topics

Hot tags