Putting together sklearn pipeline+nested cross-validation for KNN regression
Asked Answered
S

1

6

I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor that includes:

  • normalize features
  • feature selection (best subset of 20 numeric features, no specific total)
  • cross-validates hyperparameter K in range 1 to 20
  • cross-validates model
  • uses RMSE as error metric

There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need.

Besides sklearn.neighbors.KNeighborsRegressor, I think I need:

sklearn.pipeline.Pipeline  
sklearn.preprocessing.Normalizer
sklearn.model_selection.GridSearchCV
sklearn.model_selection.cross_val_score

sklearn.feature_selection.selectKBest
OR
sklearn.feature_selection.SelectFromModel

Would someone please show me what defining this pipeline/workflow might look like? I think it should be something like this:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV

# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])

# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,21))}

# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10), 
                         X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)

rmses = np.abs(scores)**(1/2)
avg_rmse = np.mean(rmses)
print(avg_rmse)

It doesn't seem to error out, but a few of my concerns are:

  • Did I perform the nested cross-validation properly so that my RMSE is unbiased?
  • If I want the final model to be selected according to the best RMSE, am I supposed to use scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV?
  • Is SelectKBest, f_classif the best option to use for selecting features for the KNeighborsRegressor model?
  • How can I see:
    • which subset of features was selected as best
    • which K was selected as best

Any help is greatly appreciated!

Saks answered 17/7, 2017 at 17:53 Comment(2)
Your code seems very okay. Also, the approach is correct to me. Do you get any error or unexpected result?Sternlight
Hey thanks for your comment. I updated my post with a little more information on my concerns.Saks
S
8

Your code seems okay.

For the scoring="neg_mean_squared_error" for both cross_val_score and GridSearchCV, I would do the same to make sure things run fine but the only way to test this is to remove the one of the two and see if the results change.

SelectKBest is a good approach but you can also use SelectFromModel or even other methods that you can find here

Finally, in order to get the best parameters and the features scores I modified a bit your code as follows:

import ...


pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])

# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,21))}

# changes here

grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error")

grid.fit(X, y)

# get the best parameters and the best estimator
print("the best estimator is \n {} ".format(grid.best_estimator_))
print("the best parameters are \n {}".format(grid.best_params_))

# get the features scores rounded in 2 decimals
pip_steps = grid.best_estimator_.named_steps['kbest']

features_scores = ['%.2f' % elem for elem in pip_steps.scores_ ]
print("the features scores are \n {}".format(features_scores))

feature_scores_pvalues = ['%.3f' % elem for elem in pip_steps.pvalues_]
print("the feature_pvalues is \n {} ".format(feature_scores_pvalues))

# create a tuple of feature names, scores and pvalues, name it "features_selected_tuple"

featurelist = ['age', 'weight']

features_selected_tuple=[(featurelist[i], features_scores[i], 
feature_scores_pvalues[i]) for i in pip_steps.get_support(indices=True)]

# Sort the tuple by score, in reverse order

features_selected_tuple = sorted(features_selected_tuple, key=lambda 
feature: float(feature[1]) , reverse=True)

# Print
print 'Selected Features, Scores, P-Values'
print features_selected_tuple

Results using my data:

the best estimator is
Pipeline(steps=[('normalize', Normalizer(copy=True, norm='l2')), ('kbest', SelectKBest(k=2, score_func=<function f_classif at 0x0000000004ABC898>)), ('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
         metric_params=None, n_jobs=1, n_neighbors=18, p=2,
         weights='uniform'))])

the best parameters are
{'kbest__k': 2, 'regressor__n_neighbors': 18}

the features scores are
['8.98', '8.80']

the feature_pvalues is
['0.000', '0.000']

Selected Features, Scores, P-Values
[('correlation', '8.98', '0.000'), ('gene', '8.80', '0.000')]
Sternlight answered 17/7, 2017 at 19:9 Comment(7)
Thanks! I see that it shows the number of parameters used for kbest__k, but is there a way to see which columns were used specifically? Does SelectKBest just try the first column, then the first and second, etc., or does it try every permutation of # of features in the selected range?Saks
@Jake I edited my post. I added the code for the features p values and scores. I think it is based on permutations as you said in your commentSternlight
@Jake Second update of my answer. now you can get the selected featuresSternlight
Thanks really appreciate it!Saks
@Jake glad that I could helpSternlight
Sorry for this late comment, but could you explain where did he/she perform the "nested" cross-validation step? I don't see it in the code. Thanks.Archy
Why didn't you use cross_val_score()?Dominik

© 2022 - 2024 — McMap. All rights reserved.