Using chi2 test for feature selection with continuous features (Scikit Learn)

Asked 15/4, 2018 at 22:44 Answered 29/11, 2021 at 17:18

python scikit-learn feature-selection chi-squared

I am trying to predict a binary (categorical) target from many continuous features, and would like to narrow your feature space before heading into model fitting. I noticed that the SelectKBest class from SKLearn's Feature Selection package has the following example on the Iris dataset (which is also predicting a binary target from continuous features):

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
(150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
(150,2)

The example uses the chi2 test to determine which features should be used in the model. However it is my understanding that the chi2 test is strictly meant to be used in situations where we have categorical features predicting categorical performance. I did not think the chi2 test could be used for scenarios like this. Is my understanding wrong? Can the chi2 test be used to test whether a categorical variable is dependent on a continuous variable?

Silicious answered 15/4, 2018 at 22:44 Comment(0)

The SelectKBest function with the chi2 test only works with categorical data. In fact, the result from the test only will have real meaning if the feature only has 1's and 0's.

If you inspect a little bit the implementation of chi2 you going to see that the code only apply a sum across each feature, which means that the function expects just binary values. Also, the parameters that receive the chi2 function indicate the following:

def chi2(X, y):
...

X : {array-like, sparse matrix}, shape = (n_samples, n_features_in)
    Sample vectors.
y : array-like, shape = (n_samples,)
    Target vector (class labels).

Which means that the function expects to receive the feature vector with all their samples. But later when the expected values are calculated, you will see:

feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = Y.mean(axis=0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)

And these lines of code only make sense if the X and Y vector only has 1's and 0's.

Ascribe answered 21/3, 2019 at 2:47 Comment(1)

But this is indeed confusing from the official doc - scikit-learn.org/stable/modules/… – Tussock 5/2, 2021 at 5:9

I agree with @lalfab however, it's not clear to me why sklearn provides an example of using chi2 on the iris dataset which has all continuous variables. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

>>> from sklearn.datasets import load_digits
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> X, y = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
>>> X_new.shape
(1797, 20)

Baroda answered 5/5, 2019 at 22:31 Comment(0)

My understand of this is that when using Chi2 for feature selection, the dependent variable has to be categorical type, but the independent variables can be either categorical or continuous variables, as long as it's non-negative. What the algorithm trying to do is firstly build a contingency table in a matrix format that reveals the multivariate frequency distribution of the variables. Then try to find the dependence structure underlying the variables using this contingency table. The Chi2 is one way to measure the dependency.

From the Wikipedia on contingency table (https://en.wikipedia.org/wiki/Contingency_table, 2020-07-04):

Standard contents of a contingency table

Multiple columns (historically, they were designed to use up all the white space of a printed page). Where each row refers to a specific sub-group in the population (in this case men or women), the columns are sometimes referred to as banner points or cuts (and the rows are sometimes referred to as stubs).

Significance tests. Typically, either column comparisons, which test for differences between columns and display these results using letters, or, cell comparisons, which use color or arrows to identify a cell in a table that stands out in some way.

Nets or netts which are sub-totals.

One or more of: percentages, row percentages, column percentages, indexes or averages.

Unweighted sample sizes (counts).

Based on this, pure binary features can be easily summed up as counts, which is how people conduct the Chi2 test usually. But as long as the features are non-negative, one can always accumulated it in the contingency table in a "meaningful" way. In the sklearn implementation, it sums up as feature_count = X.sum(axis=0), then later averaged on class_prob.

Powerful answered 4/7, 2020 at 13:30 Comment(1)

It sounds like you're saying the continuous variable is binned to create a category which then has a frequency count like you'd calculate for a histogram. If this is the case, couldn't one calculate a Chi^2 test statistic and p-value by binning two continuous variables? Of course you'd be losing information by binning and would be better off using Pearson's correlations, but is there any way to quantify what's being lost by the binning process? – Verisimilar 23/3, 2021 at 2:53

In my understanding, you cannot use chi-square (chi2) for continuous variables.The chi2 calculation requires to build the contingency table, where you count occurrences of each category of the variables of interest. As the cells in that RC table correspond to particular categories, I cannot see how such table could be built from continuous variables without significant preprocessing.

So, the iris example which you quote, in my view, is an example of incorrect usage.

But there are more problems with the existing implementation of the chi2 feature reduction in Scikit-learn. First, as @lalfab wrote, the implementation requires binary feature, but the documentation is not clear about this. This led to common perception in the community that SelectKBest could be used for categorical features, while in fact it cannot. Second, the Scikit-learn implementation fails to implement the chi2 condition (80% cells of RC table need to have expected count >=5) which leads to incorrect results if some categorical features have many possible values. All in all, in my view this method should not be used neither for continuous, nor for categorical features (except binary). I wrote more about this below:

Here is the Scikit-learn bug request #21455: and here the article and the alternative implementation:

Polychrome answered 29/11, 2021 at 17:18 Comment(1)

Good catch @Data Man Another problem of sklearn's chi2 implementation is that it could output sparse data as pointed here. I'm gonna check your references. – Jemina 15/9, 2022 at 23:55

Recommended topics

Hot tags