Putting together sklearn pipeline+nested cross-validation for KNN regression

sklearn.pipeline.Pipeline sklearn.preprocessing.Normalizer sklearn.model_selection.GridSearchCV sklearn.model_selection.cross_val_score sklearn.feature_selection.selectKBest OR sklearn.feature_selection.SelectFromModel

import numpy as np from sklearn.pipeline import Pipeline from sklearn.preprocessing import Normalizer from sklearn.feature_selection import SelectKBest, f_classif from sklearn.neighbors import KNeighborsRegressor from sklearn.model_selection import cross_val_score, GridSearchCV # build regression pipeline pipeline = Pipeline([('normalize', Normalizer()), ('kbest', SelectKBest(f_classif)), ('regressor', KNeighborsRegressor())]) # try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features) parameters = {'kbest__k': list(range(1, X.shape[1]+1)), 'regressor__n_neighbors': list(range(1,21))} # outer cross-validation on model, inner cross-validation on hyperparameters scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10), X, y, cv=10, scoring="neg_mean_squared_error", verbose=2) rmses = np.abs(scores)**(1/2) avg_rmse = np.mean(rmses) print(avg_rmse)

Your code seems okay.

For the scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV, I would do the same to make sure things run fine but the only way to test this is to remove the one of the two and see if the results change.

SelectKBest is a good approach but you can also use SelectFromModel or even other methods that you can find here

Finally, in order to get the best parameters and the features scores I modified a bit your code as follows:

import ...


pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])

# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,21))}

# changes here

grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error")

grid.fit(X, y)

# get the best parameters and the best estimator
print("the best estimator is \n {} ".format(grid.best_estimator_))
print("the best parameters are \n {}".format(grid.best_params_))

# get the features scores rounded in 2 decimals
pip_steps = grid.best_estimator_.named_steps['kbest']

features_scores = ['%.2f' % elem for elem in pip_steps.scores_ ]
print("the features scores are \n {}".format(features_scores))

feature_scores_pvalues = ['%.3f' % elem for elem in pip_steps.pvalues_]
print("the feature_pvalues is \n {} ".format(feature_scores_pvalues))

# create a tuple of feature names, scores and pvalues, name it "features_selected_tuple"

featurelist = ['age', 'weight']

features_selected_tuple=[(featurelist[i], features_scores[i], 
feature_scores_pvalues[i]) for i in pip_steps.get_support(indices=True)]

# Sort the tuple by score, in reverse order

features_selected_tuple = sorted(features_selected_tuple, key=lambda 
feature: float(feature[1]) , reverse=True)

# Print
print 'Selected Features, Scores, P-Values'
print features_selected_tuple

Results using my data:

the best estimator is
Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=2, score_func=<function f_classif at 0x0000000004ABC898>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=18, p=2,
         weights='uniform'))])

the best parameters are
{'kbest__k': 2, 'regressor__n_neighbors': 18}

the features scores are
['8.98', '8.80']

the feature_pvalues is
['0.000', '0.000']

Selected Features, Scores, P-Values
[('correlation', '8.98', '0.000'), ('gene', '8.80', '0.000')]

Recommended topics

Hot tags