How to find the importance of the features for a logistic regression model?
Asked Answered
D

2

79

I have a binary prediction model trained by logistic regression algorithm. I want know which features (predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter which comes from the scikit-learn package, but I don't know whether it is enough for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.

Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction.

Dimorphous answered 2/12, 2015 at 20:11 Comment(2)
Can you perhaps include an example to make things more concrete?Echinoid
Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?Dimorphous
P
93

One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.

Consider this example:

import numpy as np    
from sklearn.linear_model import LogisticRegression

x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])

m = LogisticRegression()
m.fit(X, y)

# The estimated coefficients will all be around 1:
print(m.coef_)

# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)

An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:

m.fit(X / np.std(X, 0), y)
print(m.coef_)

Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).

I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

Partridge answered 2/12, 2015 at 20:52 Comment(6)
Thank you for the explanation. One more thing, what does a negative value of m.coef_ mean? Does it mean like it is more discriminative for decision of negative class? Same question for positive values, too.Dimorphous
A negative coefficient means that higher value of the corresponding feature pushes the classification more towards the negative class.Partridge
Note that this approach may be misleading: the coefficient size does not always reflect feature importance. As a counterexample. think of this: x1 = np.random.randn(100), x2 = x1 + 0.00001*np.random.randn(100), x3 = np.random.randn(100), y = 100*x1 - 100*x2 + x3 (A more correct approach is to turn some features on and off, and compare predictive powers. )Colly
@PeterFranek Let us see how your counterexample works out in practice: pastebin.com/NXPxtPwc Note how the resulting model is "smart" enough to estimate smaller coefficients for the correlated features and thus correctly concluding that it is the third value which is the more important one. Try coming up with a working counter example ;)Partridge
And, more generally, note that the questions of "how to understand the importance of features in an (already fitted) model of type X" and "how to understand the most influential features in the data in general" are different. Depending on your fitting process you may end up with different models for the same data - some features may be deemed more important by one model, while others - by another. The important features "within a model" would only be important "in the data in general" when your model was estimated in a somewhat "valid" way in the first place.Partridge
In particular, if the most important feature in your data has a nonlinear dependency on the output, most linear models may not discover this, no matter how you tease them. Hence, it is nice to remember about the differences between modeling and model interpretation.Partridge
R
7

Since scikit-learn 0.22, sklearn defines a sklearn.inspection module which implements permutation_importance, which can be used to find the most important features - higher value indicates higher "importance" or the the corresponding feature contributes a larger fraction of whatever metrics was used to evaluate the model (the default for LogisticRegression is accuracy).

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance

# initialize sample (using the same setup as in KT.'s)
X = np.random.standard_normal((100,3)) * [1, 4, 0.5]
y = (3 + X.sum(axis=1) + 0.2*np.random.standard_normal()) > 0

# fit a model
model = LogisticRegression().fit(X, y)
# compute importances
model_fi = permutation_importance(model, X, y)
model_fi['importances_mean']                    # array([0.07 , 0.352, 0.02 ])

So in the example above, the most important feature is the second feature, followed by the first and the third. This is the same ordinal ranking as the one suggested in KT.'s post.

One nice thing about permutation_importance is that both training and test datasets may be passed to it to identify which features might cause the model to overfit.


You can read more about it in the documentation, you can even find the outline of the algorithm.

Robustious answered 6/3, 2023 at 23:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.