Let's say that I want to compare different dimensionality reduction approaches for a particular (supervised) dataset that consists of n>2 features via cross-validation and by using the pipeline
class.
For example, if I want to experiment with PCA vs LDA I could do something like:
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA
clf_all = Pipeline(steps=[
('scaler', StandardScaler()),
('classification', GaussianNB())
])
clf_pca = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', PCA(n_components=2)),
('classification', GaussianNB())
])
clf_lda = Pipeline(steps=[
('scaler', StandardScaler()),
('reduce_dim', LDA(n_components=2)),
('classification', GaussianNB())
])
# Constructing the k-fold cross validation iterator (k=10)
cv = KFold(n=X_train.shape[0], # total number of samples
n_folds=10, # number of folds the dataset is divided into
shuffle=True,
random_state=123)
scores = [
cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
for clf in [clf_all, clf_pca, clf_lda]
]
But now, let's say that -- based on some "domain knowledge" -- I have the hypothesis that the features 3 & 4 might be "good features" (the third and fourth column of the array X_train
) and I want to compare them with the other approaches.
How would I include such a manual feature selection in the pipeline
?
For example
def select_3_and_4(X_train):
return X_train[:,2:4]
clf_all = Pipeline(steps=[
('scaler', StandardScaler()),
('feature_select', select_3_and_4),
('classification', GaussianNB())
])
would obviously not work.
So I assume I have to create a feature selection class that has a transform
dummy method and fit
method that returns the two columns of the numpy
array?? Or is there a better way?