find important features for classification
Asked Answered
M

1

15

I'm trying to classify some EEG data using a logistic regression model (this seems to give the best classification of my data). The data I have is from a multichannel EEG setup so in essence I have a matrix of 63 x 116 x 50 (that is channels x time points x number of trials (there are two trial types of 50), I have reshaped this to a long vector, one for each trial.

What I would like to do is after the classification to see which features were the most useful in classifying the trials. How can I do that and is it possible to test the significance of these features? e.g. to say that the classification was drive mainly by N-features and these are feature x to z. So I could for instance say that channel 10 at time point 90-95 was significant or important for the classification.

So is this possible or am I asking the wrong question?

any comments or paper references are much appreciated.

Monocular answered 3/4, 2013 at 19:26 Comment(0)
L
39

Scikit-learn includes quite a few methods for feature ranking, among them:

(see more at http://scikit-learn.org/stable/modules/feature_selection.html)

Among those, I definitely recommend giving Randomized Logistic Regression a shot. In my experience, it consistently outperforms other methods and is very stable. Paper on this: http://arxiv.org/pdf/0809.2932v2.pdf

Edit: I have written a series of blog posts on different feature selection methods and their pros and cons, which are probably useful for answering this question in more detail:

Language answered 3/4, 2013 at 22:5 Comment(7)
The non-randomized L1-penalized models are also nice (i.e. L1 penalized Logistic regression and LinearSVC). I don't have much experience with the randomized versions yet.Uncompromising
Second @AndreasMueller's suggestion, L1-penalty SVM is a surprisingly good feature selection algorithm for some tasks (that look nothing like EEG reading, so YMMV). The document classification example does this, see L1LinearSVC there.Guizot
In my experience, the case where the non-randomized methods can fail is where you have strongly multicollinear features, in which case some features can be among the top ones on one subset of the data, while being regularized out for another subset.Language
You're right. Just think it is worth a shot. It won't do worth than univariate ;)Uncompromising
@snarly the document classification example has been moved to scikit-learn.org/stable/auto_examples/text/…Liles
RandomizedLogisticRegression is being deprecated :( DEPRECATED: The class RandomizedLogisticRegression is deprecated in 0.19 and will be removed in 0.21. :(Pilarpilaster
I think this post could be improved/updated be considering too the library ile5. Here a post with examples too in a similar discussion. They mentioned both ile5 and treeinterpreter as in this answer.Pence

© 2022 - 2024 — McMap. All rights reserved.