The easiest way for getting feature names after running SelectKBest in Scikit Learn
Asked Answered
F

9

84

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features:

Let's assume I would like to conduct the experiment selecting 5 best features:

from sklearn.feature_selection import SelectKBest, f_classif

select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit_transform(features_dataframe, targeted_class)

Now, if I add the line:

import pandas as pd

dataframe = pd.DataFrame(select_k_best_classifier)

I receive a new dataframe without feature names (only index starting from 0 to 4), but I want to create a dataframe with the new selected features, in a way like this:

dataframe = pd.DataFrame(fit_transofrmed_features, columns=features_names)

My question is how to create the features_names list?

I know that I should use:

 select_k_best_classifier.get_support()

Which returns an array of boolean values, where true values indices represent the column that should be selected in the original dataframe.

How should I use this boolean array with the array of all features names I can get via the method feature_names = list(features_dataframe.columns.values) ?

Fifine answered 3/10, 2016 at 19:35 Comment(0)
N
32

You can do the following :

mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features

for bool_val, feature in zip(mask, feature_names):
    if bool_val:
        new_features.append(feature)

Then change the name of your features:

dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
Nichrome answered 4/10, 2016 at 8:46 Comment(1)
note that .get_support() must be applied to SelectKBest(score_func=f_classif, k=5) (a class 'sklearn.feature_selection.univariate_selection.SelectKBest') , not SelectKBest(score_func=f_classif, k=5).fit_transform(X,Y) (a numpy array)Fornax
A
108

This doesn't require loops.

# Create and fit selector
selector = SelectKBest(f_classif, k=5)
selector.fit(features_df, target)
# Get columns to keep and create new dataframe with those only
cols_idxs = selector.get_support(indices=True)
features_df_new = features_df.iloc[:,cols_idxs]
Ambary answered 3/5, 2017 at 16:14 Comment(1)
a little correction: features_df_new = features_df.iloc[:,cols]Dramamine
L
40

For me this code works fine and is more 'pythonic':

mask = select_k_best_classifier.get_support()
new_features = features_dataframe.columns[mask]
Loam answered 12/9, 2017 at 10:35 Comment(0)
N
32

You can do the following :

mask = select_k_best_classifier.get_support() #list of booleans
new_features = [] # The list of your K best features

for bool_val, feature in zip(mask, feature_names):
    if bool_val:
        new_features.append(feature)

Then change the name of your features:

dataframe = pd.DataFrame(fit_transofrmed_features, columns=new_features)
Nichrome answered 4/10, 2016 at 8:46 Comment(1)
note that .get_support() must be applied to SelectKBest(score_func=f_classif, k=5) (a class 'sklearn.feature_selection.univariate_selection.SelectKBest') , not SelectKBest(score_func=f_classif, k=5).fit_transform(X,Y) (a numpy array)Fornax
V
9

Following code will help you in finding top K features with their F-scores. Let, X is the pandas dataframe, whose columns are all the features and y is the list of class labels.

import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
#Suppose, we select 5 features with top 5 Fisher scores
selector = SelectKBest(f_classif, k = 5)
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(X, y)
names = X.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
names_scores = list(zip(names, scores))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)
Viper answered 29/7, 2017 at 7:3 Comment(0)
A
3

Select Best 10 feature according to chi2;

from sklearn.feature_selection import SelectKBest, chi2

KBest = SelectKBest(chi2, k=10).fit(X, y) 

Get features with get_support()

f = KBest.get_support(1) #the most important features

Create new df called X_new;

X_new = X[X.columns[f]] # final features`
Ambrogino answered 24/11, 2020 at 15:38 Comment(0)
B
2

As of Scikit-learn 1.0, transformers have the get_feature_names_out method, which means you can write

dataframe = pd.DataFrame(fit_transformed_features, columns=transformer.get_features_names_out())
Badly answered 11/3, 2022 at 16:34 Comment(0)
S
1

There is an another alternative method, which ,however, is not fast as above solutions.

# Use the selector to retrieve the best features
X_new = select_k_best_classifier.fit_transform(train[feature_cols],train['is_attributed'])

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(select_k_best_classifier.inverse_transform(X_new),
                            index=train.index,
                            columns= feature_cols)
selected_columns = selected_features.columns[selected_features.var() !=0]
Sanitarium answered 2/4, 2020 at 15:41 Comment(0)
N
1
# Fit the SelectKBest instance
select_k_best_classifier = SelectKBest(score_func=f_classif, k=5).fit(features_dataframe, targeted_class)

# Extract the required features
new_features  = select_k_best_classifier.get_feature_names_out(features_names)
Noleta answered 30/1, 2022 at 2:26 Comment(0)
P
1

Suppose that you want to choose 10 best features:

import pandas as pd
from sklearn.feature_selection import SelectKBest

selector = SelectKBest(score_func=chi2, k = 10)
selector.fit_transform(X, y)
features_names = selector.feature_names_in_
print(features_names)
Parisparish answered 7/1, 2023 at 20:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.