After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code:
def VarianceThreshold_selector(data):
selector = VarianceThreshold(.5)
selector.fit(data)
selector = (pd.DataFrame(selector.transform(data)))
return selector
x = VarianceThreshold_selector(data)
print(x)
changes the following data (this is just a small subset of the rows):
Survived Pclass Sex Age SibSp Parch Nonsense
0 3 1 22 1 0 0
1 1 2 38 1 0 0
1 3 2 26 0 0 0
into this (again just a small subset of the rows)
0 1 2 3
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :
Pclass Age Sibsp Parch
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Is there an easy way to do this? I'm very new with Scikit Learn, so I'm probably just doing something silly.
pandas
data types with named columns and the like, so any time you use something like the.transform
method of a scikit object, you're going to lose all that information. If you can track it separately (i.e., retrieve the column names as you describe), you can pass it back it to specify the new column names after recreating a new DataFrame. – Klatt