Retain feature names after Scikit Feature Selection
Asked Answered
K

6

22

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code:

def VarianceThreshold_selector(data):
    selector = VarianceThreshold(.5) 
    selector.fit(data)
    selector = (pd.DataFrame(selector.transform(data)))
    return selector
x = VarianceThreshold_selector(data)
print(x)

changes the following data (this is just a small subset of the rows):

Survived    Pclass  Sex Age SibSp   Parch   Nonsense
0             3      1  22   1        0        0
1             1      2  38   1        0        0
1             3      2  26   0        0        0

into this (again just a small subset of the rows)

     0         1      2     3
0    3      22.0      1     0
1    1      38.0      1     0
2    3      26.0      0     0

Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :

     Pclass         Age      Sibsp     Parch
0        3          22.0         1         0
1        1          38.0         1         0
2        3          26.0         0         0

Is there an easy way to do this? I'm very new with Scikit Learn, so I'm probably just doing something silly.

Kr answered 2/10, 2016 at 0:56 Comment(1)
Scikit itself doesn't support pandas data types with named columns and the like, so any time you use something like the .transform method of a scikit object, you're going to lose all that information. If you can track it separately (i.e., retrieve the column names as you describe), you can pass it back it to specify the new column names after recreating a new DataFrame.Klatt
T
42

Would something like this help? If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.

>>> df
   Survived  Pclass  Sex  Age  SibSp  Parch  Nonsense
0         0       3    1   22      1      0         0
1         1       1    2   38      1      0         0
2         1       3    2   26      0      0         0

>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

>>> variance_threshold_selector(df, 0.5)
   Pclass  Age
0       3   22
1       1   38
2       3   26
>>> variance_threshold_selector(df, 0.9)
   Age
0   22
1   38
2   26
>>> variance_threshold_selector(df, 0.1)
   Survived  Pclass  Sex  Age  SibSp
0         0       3    1   22      1
1         1       1    2   38      1
2         1       3    2   26      0
Tephrite answered 2/10, 2016 at 2:30 Comment(2)
could you please edit your answer? selector.get_support(indices=True) returns an array of indices. Thus, this line: labels = [columns[x] for x in selector.get_support(indices=True) if x] has a latent bug where column 0 will be skippedMedusa
That looks correct! The columns variable is no longer used, but is irrelevantMedusa
F
20

I came here looking for a way to get transform() or fit_transform() to return a data frame, but I suspect it's not supported.

However, you can subset the data a bit more cleanly like this:

data_transformed = data.loc[:, selector.get_support()]
Formalize answered 8/12, 2016 at 13:56 Comment(1)
This should be higher up, elegant and simple solution to the question.Choleric
K
6

There's probably better ways to do this, but for those interested here's how I did:

def VarianceThreshold_selector(data):

    #Select Model
    selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples

    #Fit the Model
    selector.fit(data)
    features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
    features = [column for column in data[features]] #Array of all nonremoved features' names

    #Format and Return
    selector = pd.DataFrame(selector.transform(data))
    selector.columns = features
    return selector
Kr answered 2/10, 2016 at 2:28 Comment(3)
We had basically the same idea with the exception of transform vs using fit_transform. Glad you figured it out.Tephrite
I'm a Python noob, but would it also be correct to do features = data.columns.values[selector.get_support(indices = True)]? I had trouble getting your approach to work with my data.Emrich
Add columns to parse the columns : features = [column for column in df_train.columns[features]]Tongs
G
2

As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable. I also added NA replacement as a standard as VarianceThreshold does not like NA values.

def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
    df1 = df.copy(deep=True) # Make a deep copy of the dataframe
    selector = VarianceThreshold(thresh)
    selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
    df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values

    return df2
Gilead answered 24/4, 2017 at 12:5 Comment(0)
M
1

how about this as a code?

columns = [col for col in df.columns]

low_var_cols = []

for col in train_file.columns:
if statistics.variance(df[col]) <= 0.1:
    low_var_cols.append(col)

then drop the columns from the dataframe?

Mendicant answered 28/3, 2021 at 17:55 Comment(0)
L
0

You can use Pandas for thresholding too

data_new = data.loc[:, data.std(axis=0) > 0.75]
Liriodendron answered 27/3, 2020 at 15:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.