Feature Importance with XGBClassifier
Asked Answered
H

10

26

Hopefully I'm reading this wrong but in the XGBoost library documentation, there is note of extracting the feature importance attributes using feature_importances_ much like sklearn's random forest.

However, for some reason, I keep getting this error: AttributeError: 'XGBClassifier' object has no attribute 'feature_importances_'

My code snippet is below:

from sklearn import datasets
import xgboost as xg
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Y = iris.target[ Y < 2] # arbitrarily removing class 2 so it can be 0 and 1
X = X[range(1,len(Y)+1)] # cutting the dataframe to match the rows in Y
xgb = xg.XGBClassifier()
fit = xgb.fit(X, Y)
fit.feature_importances_

It seems that you can compute feature importance using the Booster object by calling the get_fscore attribute. The only reason I'm using XGBClassifier over Booster is because it is able to be wrapped in a sklearn pipeline. Any thoughts on feature extractions? Is anyone else experiencing this?

Hedwig answered 5/7, 2016 at 21:0 Comment(8)
I can't reproduce the problem with your snippet. What version of XGBoost do you have?Stillbirth
from my pip freeze , i have xgboost==0.4a30Hedwig
Does this help? kaggle.com/mmueller/…Kirkuk
I have seen this before. The problem is however, is that the get_fscore method is bound to the Booster object rather than XGBClassifier from my understanding. See the doc hereHedwig
I have 0.4 and your snippet works with no problem.Stillbirth
Hrm this is odd. The current version is 0.4a30 right? It appears so looking at their repoHedwig
@MinhMai using feature_importances_ via booster() are you able to get the column names accurately ? In my case, it throws a KeyError that not certain features are not present in the data.Glori
You can plot XGBClassifier feature importance with names directly: xgboosting.com/…Methanol
C
17

As the comments indicate, I suspect your issue is a versioning one. However if you do not want to/can't update, then the following function should work for you.

def get_xgb_imp(xgb, feat_names):
    from numpy import array
    imp_vals = xgb.booster().get_fscore()
    imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
    total = array(imp_dict.values()).sum()
    return {k:v/total for k,v in imp_dict.items()}


>>> import numpy as np
>>> from xgboost import XGBClassifier
>>> 
>>> feat_names = ['var1','var2','var3','var4','var5']
>>> np.random.seed(1)
>>> X = np.random.rand(100,5)
>>> y = np.random.rand(100).round()
>>> xgb = XGBClassifier(n_estimators=10)
>>> xgb = xgb.fit(X,y)
>>> 
>>> get_xgb_imp(xgb,feat_names)
{'var5': 0.0, 'var4': 0.20408163265306123, 'var1': 0.34693877551020408, 'var3': 0.22448979591836735, 'var2': 0.22448979591836735}
Cyrilcyrill answered 6/7, 2016 at 15:22 Comment(6)
Interesting approach! However, would it matter if I tune my parameters for XGBClassifer? How would I ensure that it would match the parameters for BoosterHedwig
you're referencing the booster() object within your XGBClassifer() object, so it will match: xgb.booster()Cyrilcyrill
I realized something strange, and is that supposed to happen? The values returned from xgb.booster().get_fscore() that should contain values for all columns the model is trained for? Because I find 2 columns missing from imp_vals, which are present in train columns, but not as key in imp_colsFeudist
I had to use xgb.get_booster().get_fscore(). Otherwise I was getting TypeError: 'str' object is not callable. I am using xgboost 0.6.Elston
I pickled my XGB object and am unable to call get_booster(): File "/usr/local/lib/python3.5/dist-packages/xgboost/sklearn.py", line 193, in get_booster raise XGBoostError('need to call fit or load_model beforehand') Tourcoing
As for today, I have the following error: TypeError Traceback (most recent call last) Cell In[152], line 1 ----> 1 get_xgb_imp(xgb_model,columns_names) Cell In[151], line 3, in get_xgb_imp(xgb, feat_names) 1 def get_xgb_imp(xgb, feat_names): 2 from numpy import array ----> 3 imp_vals = xgb.booster().get_fscore() 4 imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))} 5 total = array(imp_dict.values()).sum() TypeError: 'NoneType' object is not callableArcadia
P
15

For xgboost, if you use xgb.fit(),then you can use the following method to get feature importance.

import pandas as pd
xgb_model=xgb.fit(x,y)
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_booster().get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
print('',xgb_fea_imp)
xgb_fea_imp.to_csv('xgb_fea_imp.csv')

from xgboost import plot_importance
plot_importance(xgb_model, )
Palanquin answered 18/6, 2018 at 4:36 Comment(0)
H
8

I found out the answer. It appears that version 0.4a30 does not have feature_importance_ attribute. Therefore if you install the xgboost package using pip install xgboost you will be unable to conduct feature extraction from the XGBClassifier object, you can refer to @David's answer if you want a workaround.

However, what I did is build it from the source by cloning the repo and running . ./build.sh which will install version 0.4 where the feature_importance_ attribute works.

Hope this helps others!

Hedwig answered 9/7, 2016 at 0:32 Comment(0)
P
5

Get Feature Importance as a sorted data frame

import pandas as pd
import numpy as np
def get_xgb_imp(xgb, feat_names):
    imp_vals = xgb.booster().get_fscore()
    feats_imp = pd.DataFrame(imp_vals,index=np.arange(2)).T
    feats_imp.iloc[:,0]= feats_imp.index    
    feats_imp.columns=['feature','importance']
    feats_imp.sort_values('importance',inplace=True,ascending=False)
    feats_imp.reset_index(drop=True,inplace=True)
    return feats_imp

feature_importance_df = get_xgb_imp(xgb, feat_names)
Puritan answered 23/4, 2018 at 13:53 Comment(0)
S
2

For those having the same problem as Luís Bianchin, "TypeError: 'str' object is not callable", I found a solution (that works for me at least) here.

In short, I found modifying David's code from

imp_vals = xgb.booster().get_fscore()

to

imp_vals = xgb.get_fscore()

worked for me.

For more detail I would recommend visiting the link above.

Big thanks to David and ianozsvald

Serpigo answered 6/5, 2019 at 20:46 Comment(0)
G
1

You can also use the built-in plot_importance function:

from xgboost import XGBClassifier, plot_importance
fit = XGBClassifier().fit(X,Y)
plot_importance(fit)

enter image description here

Gripsack answered 12/8, 2020 at 7:15 Comment(0)
P
1

The alternative to built-in feature importance can be:

I really like shap package because it provides additional plots. Example:

Importance Plot

xgboost shap importance

Summary Plot

xgboost shap summary

Dependence Plot

xgboost shap dependence

You can read about alternative ways to compute feature importance in Xgboost in this blog post of mine.

Piselli answered 17/8, 2020 at 12:8 Comment(0)
S
0

An update of the accepted answer since it no longer works:

def get_xgb_imp(xgb_model, feat_names):
    imp_vals = xgb_model.get_fscore()
    imp_dict = {feat: float(imp_vals.get(feat, 0.)) for feat in feat_names}
    total = sum(list(imp_dict.values()))
    return {k: round(v/total, 5) for k,v in imp_dict.items()}
Sitnik answered 18/10, 2019 at 13:42 Comment(0)
T
0

It seems like the api keeps on changing. For xgboost version 1.0.2, just changing from imp_vals = xgb.booster().get_fscore() to imp_vals = xgb.get_booster().get_fscore() in @David's answer does the trick. The updated code is -

from numpy import array

def get_xgb_imp(xgb, feat_names):
    imp_vals = xgb.get_booster().get_fscore()
    imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
    total = array(imp_dict.values()).sum()
    return {k:v/total for k,v in imp_dict.items()}
Thespian answered 19/3, 2020 at 13:7 Comment(0)
L
0

I used the following code to get feature_importance. Also, I used DictVectorizer() in the pipeline for one_hot_encoding. If you use

v = DictVectorizer()
X_to_dict = X.to_dict("records")
X_transformed = v.fit_transform(X_to_dict)
feature_names = v.get_feature_names()
best_model.get_booster().feature_names = feature_names
xgb.plot_importance(best_model.get_booster())

You can obtain the f_score plot. But I wanted to plot the feature importance against the feature names. So I modified it further. f, ax = plt.subplots(figsize=(10, 30)) plt.barh(feature_names, best_model.feature_importances_) plt.xticks(rotation = 90) plt.show()

Liatris answered 2/9, 2022 at 14:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.