Logistic Regression: How to find top three feature that have highest weights?

Asked 23/4, 2017 at 21:20 Answered 13/8, 2022 at 16:48

Solved python machine-learning scikit-learn logistic-regression feature-selection

I am working on UCI breast cancer dataset and trying to find the top 3 features that have highest weights. I was able to find the weight of all features using logmodel.coef_ but how can I get the feature names? Below is my code, output and dataset (which is imported from scikit).

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    cancer.data, cancer.target, stratify=cancer.target, random_state=42)

logmodel = LogisticRegression(C=1.0).fit(X_train, y_train)
logmodel.coef_[0]

Above code outputs weights array. Using these weights how can I get the associate feature names?

Output:
    array([  1.90876683e+00,   9.98788148e-02,  -7.65567571e-02,
             1.30875965e-03,  -1.36948317e-01,  -3.86693503e-01,
            -5.71948682e-01,  -2.83323656e-01,  -2.23813863e-01,
            -3.50526844e-02,   3.04455316e-03,   1.25223693e+00,
             9.49523571e-02,  -9.63789785e-02,  -1.32044174e-02,
            -2.43125981e-02,  -5.86034313e-02,  -3.35199227e-02,
            -4.10795998e-02,   1.53205924e-03,   1.24707244e+00,
            -3.19709151e-01,  -9.61881472e-02,  -2.66335879e-02,
            -2.44041661e-01,  -1.24420873e+00,  -1.58319440e+00,
            -5.78354663e-01,  -6.80060645e-01,  -1.30760323e-01])

Thanks. I would really appreciate any help on this.

Lowminded answered 23/4, 2017 at 21:20 Comment(0)

This will do the job:

import numpy as np
coefs=logmodel.coef_[0]
top_three = np.argpartition(coefs, -3)[-3:]
print(cancer.feature_names[top_three])

This prints

['worst radius' 'texture error' 'mean radius']

Note that these features are the top three, but they are not necessarily sorted among themselves. If you want them to be sorted, you can do:

import numpy as np
coefs=logmodel.coef_[0]
top_three = np.argpartition(coefs, -3)[-3:]
top_three_sorted=top_three[np.argsort(coefs[top_three])]
print(cancer.feature_names[top_three_sorted])

Holbrooke answered 23/4, 2017 at 21:27 Comment(6)

Thanks a lot. Can you please tell me what np.argpartition(coefs, -3) does? – Lowminded 23/4, 2017 at 22:18

The function np.argpartition(coefs, k) will return an array that starts with the indices of the smallest n-k elements in coefs and ends with the indices of the largest k elements in coefs. Since it does not perform a full sort, it is more efficient than doing a full sort of the array (note that using -3 in the function is the same as using len(coefs)-3). If you don't need the efficiency you could also replace that row with top_three = np.argsort(coefs)[-3:] – Holbrooke 23/4, 2017 at 22:25

I was wondering if I need to do sorting at all? top_three = np.argpartition(coefs, -3)[-3:] gives me the top three most weighted features, right? but why do I need to sort top_three_sorted=top_three[np.argsort(coefs[top_three])]. Will that not change the result? – Lowminded 23/4, 2017 at 22:44

Based on your question you don't need the extra sotring, and top_three = np.argpartition(coefs, -3)[-3:] will do the job. I just wasn't sure when I read it originally (I wasn't sure if you just need the top three, or if you want to have them ordered like [third largest, socond largest, largest]), so I answered on both scenarios – Holbrooke 23/4, 2017 at 22:46

Thank you for helping on this – Lowminded 23/4, 2017 at 23:3

Note that this will print them in ascending order (least important to most important), which is not so intuitive, at least to me. So I simply reversed the list top_three_sorted. – Aloe 21/5, 2021 at 15:35

n=len(data.columns)-1

percentage_change=[]

for i in range (n):

    cp=(difference[i]/wei[i])*100

    percentage_change.append(cp)
    
columns=list(data.columns.values)

indices=sorted(range(len(percentage_change)), key=lambda i: percentage_change[i])[-3:]

print("the top 3 features which have higher % change in weights ")

for j in indices:

    print(columns[j])

Lumbar answered 13/8, 2022 at 16:48 Comment(1)

Welcome to Stack Overflow. Code is a lot more helpful when it is accompanied by an explanation. Stack Overflow is about learning, not providing snippets to blindly copy and paste. Please edit your answer and explain how it answers the specific question being asked. See How to Answer. – Essive 16/8, 2022 at 20:45

Recommended topics

Hot tags