Feature selection using scikit-learn

Asked 11/9, 2014 at 15:53 Answered 18/5, 2023 at 0:1

python machine-learning scikit-learn feature-selection chi-squared

I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method:

SelectKBest(chi2, k=10).fit_transform(A1, A2)

Since my dataset consist of negative values, I get the following error:

ValueError                                Traceback (most recent call last)

/media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>()
----> 1 
      2 
      3 
      4 
      5 

/usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y,     **fit_params)
    427         else:
    428             # fit method of arity 2 (supervised transformation)

--> 429             return self.fit(X, y, **fit_params).transform(X)
    430 
    431 

/usr/local/lib/python2.6/dist-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    300         self._check_params(X, y)
    301 
--> 302         self.scores_, self.pvalues_ = self.score_func(X, y)
    303         self.scores_ = np.asarray(self.scores_)
    304         self.pvalues_ = np.asarray(self.pvalues_)

/usr/local/lib/python2.6/dist-  packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    190     X = atleast2d_or_csr(X)
    191     if np.any((X.data if issparse(X) else X) < 0):
--> 192         raise ValueError("Input X must be non-negative.")
    193 
    194     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

Can someone tell me how can I transform my data ?

Consolation answered 11/9, 2014 at 15:53 Comment(3)

You could normalise the values to between 0 and 1 or take absolute values perhaps – Hyetology 11/9, 2014 at 15:55

If your data is not non-negative, maybe chi2 is not a good method. You can use f_score. What is the nature of your data? – Outguard 12/9, 2014 at 17:7

Thank you EdChum and Andreas. My data consist of min, max, mean, median and FFT of accelerometer signal – Consolation 12/9, 2014 at 22:14

The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.

You are saying that your features are "min, max, mean, median and FFT of accelerometer signal". In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.

If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:

sklearn.feature_selection.f_classif computes ANOVA f-value
sklearn.feature_selection.mutual_info_classif computes the mutual information

Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

Palish answered 6/10, 2017 at 14:37 Comment(4)

Just use sklearn.preprocessing.MinMaxScaler().fit_transform(YOUR_TRAINING_FEATURES_HERE) with the default values to scale your training features to be from 0 to 1 – Ptolemaeus 7/12, 2019 at 22:43

"it's not a big deal to pick anyone", just wanted to check that I'm reading you correctly here - do you mean that it's not a bit deal to choose any of f_classif , mutual_info_classif , or SelectKBest – Woodpecker 15/11, 2020 at 14:26

@Ptolemaeus I am using that right now, but I have the same error: ...scaler = MinMaxScaler() df1[self.num_features] = scaler.fit_transform(df1[self.num_features]) return df1 – Afrikaans 11/3, 2021 at 18:39

@Palish - I am encountering similar error. But I filtered my dataframe to include only columns with positive values but I still get the same error. Can you help me please? #71338663 – Monopteros 3/3, 2022 at 13:42

As others have mentioned to get around the error, you can scale the data to be between 0 and 1, select features from the scaled data and use it to train your model.

import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

X, y = make_classification(random_state=0)
topk = 5

# scale the data to be between 0 and 1
sc = MinMaxScaler()
X_sc = sc.fit_transform(X)

# select from the scaled data
skb = SelectKBest(chi2, k=topk)
X_sc_selected = skb.fit_transform(X_sc, y)

# build model using (X_sc_selected, y)
lr = LogisticRegression(random_state=0)
lr.fit(X_sc_selected, y)

lr.score(X_sc_selected, y)  # 0.87

If the original data is very important (you want to keep the negative values), you can also select data using top-k scores from SelectKBest, i.e. instead of transform-ing the data, slice it.

# fit feature selector with the scaled data
skb = SelectKBest(chi2, k=topk)
skb.fit(X_sc, y)

# column index of top-k features
cols = np.sort(skb.scores_.argsort()[-topk:])
# index the top-k features from X
X_selected = X[:, cols]

# build model using (X_selected, y)
lr = LogisticRegression(random_state=0)
lr.fit(X_selected, y)

lr.score(X_selected, y)  # 0.92

Note that skb.transform() is also really like indexing the columns. For example, (X_sc[:, cols] == X_sc_selected).all() returns True.

Circumferential answered 18/5, 2023 at 0:1 Comment(0)

Recommended topics

Hot tags