Removing features with low variance using scikit-learn
Asked Answered
E

4

14

scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,

http://scikit-learn.org/stable/modules/feature_selection.html

but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.

The code below has been taken from the tutorial.

    from sklearn.feature_selection import VarianceThreshold
    X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
    sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
    sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.

For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.

thanks

Equally answered 27/3, 2015 at 10:52 Comment(3)
Sklearn works different than WEKA. In this case, instead of giving you a list of the best features, sklearn returns directly a new array with the best features. Do you really need the list? I guess they list could be computed with a workaround, but is really needed?Nationalism
@iluengo as per my understanding (as i am not very experience in ML but an enthusiastic leaner ) training and test set should have same number of features with same indexing as otherwise in case of weka it rase error. If Test set is internal derived with the data split i would always have same features and same indexing but if we use external data test set or unknown data set on which predictions to be make without known the name of feature how we could make the unknown data.Equally
yep you got that right. I was thinking only in the training ahahNationalism
N
21

Then, what you can do, if I'm not wrong is:

In the case of the VarianceThreshold, you can call the method fit instead of fit_transform. This will fit data, and the resulting variances will be stored in vt.variances_ (assuming vt is your object).

Having a threhold, you can extract the features of the transformation as fit_transform would do:

X[:, vt.variances_ > threshold]

Or get the indexes as:

idx = np.where(vt.variances_ > threshold)[0]

Or as a mask

mask = vt.variances_ > threshold

PS: default threshold is 0

EDIT:

A more straight forward to do, is by using the method get_support of the class VarianceThreshold. From the documentation:

get_support([indices])  Get a mask, or integer index, of the features selected

You should call this method after fit or fit_transform.

Nationalism answered 27/3, 2015 at 13:10 Comment(2)
After fitting, the filtered data frame can be obtained using: df.loc[:, sel.get_support()] where df is a pandas data frame and sel is a VarianceThreshold.Altercation
@arun: I think your solution is actually the best. Thanks.Cydnus
P
9
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold

# Just make a convenience function; this one wraps the VarianceThreshold
# transformer but you can pass it a pandas dataframe and get one in return

def get_low_variance_columns(dframe=None, columns=None,
                             skip_columns=None, thresh=0.0,
                             autoremove=False):
    """
    Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
    """
    print("Finding low-variance features.")
    try:
        # get list of all the original df columns
        all_columns = dframe.columns

        # remove `skip_columns`
        remaining_columns = all_columns.drop(skip_columns)

        # get length of new index
        max_index = len(remaining_columns) - 1

        # get indices for `skip_columns`
        skipped_idx = [all_columns.get_loc(column)
                       for column
                       in skip_columns]

        # adjust insert location by the number of columns removed
        # (for non-zero insertion locations) to keep relative
        # locations intact
        for idx, item in enumerate(skipped_idx):
            if item > max_index:
                diff = item - max_index
                skipped_idx[idx] -= diff
            if item == max_index:
                diff = item - len(skip_columns)
                skipped_idx[idx] -= diff
            if idx == 0:
                skipped_idx[idx] = item

        # get values of `skip_columns`
        skipped_values = dframe.iloc[:, skipped_idx].values

        # get dataframe values
        X = dframe.loc[:, remaining_columns].values

        # instantiate VarianceThreshold object
        vt = VarianceThreshold(threshold=thresh)

        # fit vt to data
        vt.fit(X)

        # get the indices of the features that are being kept
        feature_indices = vt.get_support(indices=True)

        # remove low-variance columns from index
        feature_names = [remaining_columns[idx]
                         for idx, _
                         in enumerate(remaining_columns)
                         if idx
                         in feature_indices]

        # get the columns to be removed
        removed_features = list(np.setdiff1d(remaining_columns,
                                             feature_names))
        print("Found {0} low-variance columns."
              .format(len(removed_features)))

        # remove the columns
        if autoremove:
            print("Removing low-variance features.")
            # remove the low-variance columns
            X_removed = vt.transform(X)

            print("Reassembling the dataframe (with low-variance "
                  "features removed).")
            # re-assemble the dataframe
            dframe = pd.DataFrame(data=X_removed,
                                  columns=feature_names)

            # add back the `skip_columns`
            for idx, index in enumerate(skipped_idx):
                dframe.insert(loc=index,
                              column=skip_columns[idx],
                              value=skipped_values[:, idx])
            print("Succesfully removed low-variance columns.")

        # do not remove columns
        else:
            print("No changes have been made to the dataframe.")

    except Exception as e:
        print(e)
        print("Could not remove low-variance features. Something "
              "went wrong.")
        pass

    return dframe, removed_features
Prettypretty answered 18/1, 2016 at 8:55 Comment(3)
Very helpful methods. I also found it useful to put the initial value of skip_columns with an empty list [] instead of None because None will throw an exception if I am not going to skip any columnsSlate
@Slate correct, but then you could just use the standard sklearn.feature_selection.VarianceThreshold with the underlying numpy array instead of the pandas.DataFrame. :)Prettypretty
@JasonWolosonovich When I'm trying the above method, I'm getting "UnboundLocalError: local variable 'removed_features' referenced before assignment".........Any Fix??Alceste
C
7

this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:

from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]] 
Cordelier answered 5/9, 2019 at 8:47 Comment(0)
G
2

When testing features I wrote this simple function that tells me which variables remained in the data frame after the VarianceThreshold is applied.

from sklearn.feature_selection import VarianceThreshold
from itertools import compress

def fs_variance(df, threshold:float=0.1):
    """
    Return a list of selected variables based on the threshold.
    """

    # The list of columns in the data frame
    features = list(df.columns)
    
    # Initialize and fit the method
    vt = VarianceThreshold(threshold = threshold)
    _ = vt.fit(df)
    
    # Get which column names which pass the threshold
    feat_select = list(compress(features, vt.get_support()))
    
    return feat_select

which returns a list of column names which are selected. For example: ['col_2','col_14', 'col_17'].

Gordongordy answered 19/4, 2021 at 16:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.