Interpreting logistic regression feature coefficient values in sklearn

Asked 24/6, 2018 at 1:7 Answered 3/4, 2021 at 4:0

Solved python scikit-learn logistic-regression feature-selection coefficients

I have fit a logistic regression model to my data. Imagine, I have four features: 1) which condition the participant received, 2) whether the participant had any prior knowledge/background about the phenomenon tested (binary response in post-experimental questionnaire), 3) time spent on the experimental task, and 4) participant age. I am trying to predict whether participants ultimately chose option A or option B. My logistic regression outputs the following feature coefficients with clf.coef_:

[[-0.68120795 -0.19073737 -2.50511774  0.14956844]]

If option A is my positive class, does this output mean that feature 3 is the most important feature for binary classification and has a negative relationship with participants choosing option A (note: I have not normalized/re-scaled my data)? I want to ensure that my understanding of the coefficients, and the information I can extract from them, is correct so I don't make any generalizations or false assumptions in my analysis.

Thanks for your help!

Alit answered 24/6, 2018 at 1:7 Comment(4)

Your understanding seem correct. To be sure, you could submit a sample to the classifier and get the result, then multiply each value in the sample by the respective coefficients. Check if they give they provide the same result. – Leicester 24/6, 2018 at 1:21

No, it is not correct. Since values are not normalized, if typical values of feature1 are an order of magnitude higher than feature3, feature1 will contribute more to classification and thus will be more important. Even with normalization direct interpretation of coefficients is kind of sketchy. A much better approach would be to use statistical tests – Messina 24/6, 2018 at 3:11

Hi @Messina good to know. I totally understand the scaling/normalization comment and I performed min-max scaling on my data so that it is on a 0-1 scale. However, do you mind expanding more on the statistical tests part of your comment. What tests are you referring to? Furthermore, what does the existing .coef_ command convey if not feature importance/effect size? – Alit 25/6, 2018 at 14:12

I'm not a statistician so it'll take some time to make a reasonable response; I'll try to answer by tomorrow. Also, I messed up with the explanation of normalization above, will fix it in the answer. – Messina 25/6, 2018 at 14:41

You are getting to the right track there. If everything is a very similar magnitude, a larger pos/neg coefficient means larger effect, all things being equal.

However, if your data isn't normalized, Marat is correct in that the magnitude of the coefficients don't mean anything (without context). For instance you could get different coefficients by changing the units of measure to be larger or smaller.

I can't see if you've included a non-zero intercept here, but keep in mind that logistic regression coefficients are in fact odds ratios, and you need to transform them to probabilities to get something more directly interpretable.

Check out this page for a good explanation: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

Lonnylonslesaunier answered 4/9, 2018 at 19:7 Comment(1)

Hi, so how can you transform these odds ratios into probabilities by using sklearn ? – Violaviolable 16/1, 2021 at 13:34

Logistic regression returns information in log odds. So you must first convert log odds to odds using np.exp and then take odds/(1 + odds).

To convert to probabilities, use a list comprehension and do the following:

[np.exp(x)/(1 + np.exp(x)) for x in clf.coef_[0]]

This page had an explanation in R for converting log odds that I referenced: https://sebastiansauer.github.io/convert_logit2prob/

Flotilla answered 3/4, 2021 at 4:0 Comment(0)

Recommended topics

Hot tags