D

8

32

I'm trying to perform feature selection by evaluating my regressions coefficient outputs, and select the features with the highest magnitude coefficients. The problem is, I don't know how to get the respective features, as only coefficients are returned form the coef._ attribute. The documentation says:

Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

I am passing into my regression.fit(A,B), where A is a 2-D array, with tfidf value for each feature in a document. Example format:

         "feature1"   "feature2"
"Doc1"    .44          .22
"Doc2"    .11          .6
"Doc3"    .22          .2

B are my target values for the data, which are just numbers 1-100 associated with each document:

"Doc1"    50
"Doc2"    11
"Doc3"    99

Using regression.coef_, I get a list of coefficients, but not their corresponding features! How can I get the features? I'm guessing I need to modfy the structure of my B targets, but I don't know how.

Dustproof answered 15/11, 2014 at 23:14 Comment(0)

V

34

What I found to work was:

X = your independent variables

coefficients = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(logistic.coef_))], axis = 1)

The assumption you stated: that the order of regression.coef_ is the same as in the TRAIN set holds true in my experiences. (works with the underlying data and also checks out with correlations between X and y)

Venireman answered 29/4, 2017 at 19:41 Comment(1)

I think you can just do pd.DataFrame(zip(X.columns, logistic.coef_)) – Gramineous 6/9, 2017 at 14:26

S

15

You can do that by creating a data frame:

cdf = pd.DataFrame(regression.coef_, X.columns, columns=['Coefficients'])
print(cdf)

Schmitz answered 3/1, 2019 at 17:24 Comment(1)

regression.coef_ is now returned as a dataframe so to do this cdf = pd.concat([pd.DataFrame(X.columns),pd.DataFrame(np.transpose(regression.coef_))], axis = 1) – Bratislava 4/11, 2021 at 2:58

N

10

coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_)})

Nonunionism answered 9/6, 2017 at 9:8 Comment(2)

This does not work for me. Exception: Data must be 1-dimensional – Kinghood 11/1, 2018 at 10:33

@Kinghood try coefficients = pd.DataFrame({"Feature":X.columns,"Coefficients":np.transpose(logistic.coef_[0, )}) – Aleksandr 5/4, 2018 at 4:27

Y

9

I suppose you are working on some feature selection task. Well using regression.coef_ does get the corresponding coefficients to the features, i.e. regression.coef_[0] corresponds to "feature1" and regression.coef_[1] corresponds to "feature2". This should be what you desire.

Well I in its turn recommend tree model from sklearn, which could also be used for feature selection. To be specific, check out here.

Yl answered 15/11, 2014 at 23:31 Comment(4)

This is true as long as regression.coef_ returns coefficinet values in the same order. Thanks. – Dustproof 16/11, 2014 at 0:55

The ExtraTreesClassifier is actually very interesting, but it seems there is no way to retrieve the actual features which it picked after the model has been fit? – Dustproof 16/11, 2014 at 1:17

@Dustproof Yes, but I always select feature by clf.feature_importances_ to retrieve the importance ranking of features. Well intuitively it is just like the coefficients of the Linear Model, isn't it? – Yl 16/11, 2014 at 1:41

Well, if you use a feature selection method like a CountVectorizer(), it has a method get_feature_names(). Then you can map get_feature_names() to .coef_ (i think they are in order, I'm not sure). However, you cannot do this with the tree. – Dustproof 16/11, 2014 at 1:56

D

6

Coefficients and features in zip

print(list(zip(X_train.columns.tolist(),logreg.coef_[0])))

Coefficients and features in DataFrame

pd.DataFrame({"Feature":X_train.columns.tolist(),"Coefficients":logreg.coef_[0]})

Dactylo answered 25/4, 2020 at 13:22 Comment(0)

D

3

This is the easiest and most intuitive way:

pd.DataFrame(logisticRegr.coef_, columns=x_train.columns)

or the same but transposing index and columns

pd.DataFrame(logisticRegr.coef_, columns=x_train.columns).T

Dishevel answered 29/12, 2021 at 13:49 Comment(0)

A

1

Suppose your train data X variable is 'df_X' then you can map into a dictionary and feed into pandas dataframe to get the mapping:

pd.DataFrame(dict(zip(df_X.columns,model.coef_[0])),index=[0]).T

Ashurbanipal answered 20/9, 2018 at 3:13 Comment(0)

D

0

Try putting them in a series with the data columns names as index:

coeffs = pd.Series(model.coef_[0], index=X.columns.values)
coeffs.sort_values(ascending = False)

Dearth answered 18/8, 2020 at 12:16 Comment(0)

Coefficients and features in zip

Coefficients and features in DataFrame

Recommended topics

Hot tags