I'm trying to figure out how to built a workflow for sklearn.neighbors.KNeighborsRegressor
that includes:
- normalize features
- feature selection (best subset of 20 numeric features, no specific total)
- cross-validates hyperparameter K in range 1 to 20
- cross-validates model
- uses RMSE as error metric
There's so many different options in scikit-learn that I'm a bit overwhelmed trying to decide which classes I need.
Besides sklearn.neighbors.KNeighborsRegressor
, I think I need:
sklearn.pipeline.Pipeline
sklearn.preprocessing.Normalizer
sklearn.model_selection.GridSearchCV
sklearn.model_selection.cross_val_score
sklearn.feature_selection.selectKBest
OR
sklearn.feature_selection.SelectFromModel
Would someone please show me what defining this pipeline/workflow might look like? I think it should be something like this:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
('kbest', SelectKBest(f_classif)),
('regressor', KNeighborsRegressor())])
# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k': list(range(1, X.shape[1]+1)),
'regressor__n_neighbors': list(range(1,21))}
# outer cross-validation on model, inner cross-validation on hyperparameters
scores = cross_val_score(GridSearchCV(pipeline, parameters, scoring="neg_mean_squared_error", cv=10),
X, y, cv=10, scoring="neg_mean_squared_error", verbose=2)
rmses = np.abs(scores)**(1/2)
avg_rmse = np.mean(rmses)
print(avg_rmse)
It doesn't seem to error out, but a few of my concerns are:
- Did I perform the nested cross-validation properly so that my RMSE is unbiased?
- If I want the final model to be selected according to the best RMSE, am I supposed to use
scoring="neg_mean_squared_error"
for bothcross_val_score
andGridSearchCV
? - Is
SelectKBest, f_classif
the best option to use for selecting features for theKNeighborsRegressor
model? - How can I see:
- which subset of features was selected as best
- which K was selected as best
Any help is greatly appreciated!