Scikit-Learn Linear Regression how to get coefficient's respective features?
Asked Answered
D

8

32

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:

Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:

         "feature1"   "feature2"
"Doc1"    .44          .22
"Doc2"    .11          .6
"Doc3"    .22          .2

B are my target values for the data, which are just numbers 1-100 associated with each document:

"Doc1"    50
"Doc2"    11
"Doc3"    99

Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.

Dustproof answered 15/11, 2014 at 23:14 Comment(0)
V
34

What I found to work was:

X = your independent variables

coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)

The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)

Venireman answered 29/4, 2017 at 19:41 Comment(1)
I think you can just do pd.DataFrame(zip(X.columns, logistic.coef_))Gramineous
S
15

You can do that by creating a data frame:

cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
print(cdf)
Schmitz answered 3/1, 2019 at 17:24 Comment(1)
regression.coef_ is now returned as a dataframe so to do this cdf = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(regression.coef_))], axis = 1)Bratislava
N
10
coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_)})
Nonunionism answered 9/6, 2017 at 9:8 Comment(2)
This does not work for me. Exception: Data must be 1-dimensionalKinghood
@Kinghood try coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_[0, )})Aleksandr
Y
9

I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.

Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.

Yl answered 15/11, 2014 at 23:31 Comment(4)
This is true as long as regression.coef_ returns coefficinet values in the same order. Thanks.Dustproof
The ExtraTreesClassifier is actually very interesting, but it seems there is no way to retrieve the actual features which it picked after the model has been fit?Dustproof
@Dustproof Yes, but I always select feature by clf.feature_importances_ to retrieve the importance ranking of features. Well intuitively it is just like the coefficients of the Linear Model, isn't it?Yl
Well, if you use a feature selection method like a CountVectorizer(), it has a method get_feature_names(). Then you can map get_feature_names() to .coef_ (i think they are in order, I'm not sure). However, you cannot do this with the tree.Dustproof
D
6

Coefficients and features in zip

print(list(zip(X_train.columns.tolist(),logreg.coef_[0])))

Coefficients and features in DataFrame

pd.DataFrame({"Feature":X_train.columns.tolist(),"Coefficients":logreg.coef_[0]})

enter image description here

Dactylo answered 25/4, 2020 at 13:22 Comment(0)
D
3

This is the easiest and most intuitive way:

pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)

or the same but transposing index and columns

pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T
Dishevel answered 29/12, 2021 at 13:49 Comment(0)
A
1

Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:

pd.DataFrame(dict(zip(df_X.columns,model.coef_[0])),index=[0]).T
Ashurbanipal answered 20/9, 2018 at 3:13 Comment(0)
D
0

Try putting them in a series with the data columns names as index:

coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)
Dearth answered 18/8, 2020 at 12:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.