What is difference between xgboost.plot_importance() and model.feature_importances_ XGBclassifier
Asked Answered
G

1

6

What is difference between xgboost.plot_importance() and model.feature_importances_ in XGBclassifier.

so here I make some dummy data

import numpy as np
import pandas as pd
# generate some random data for demonstration purpose, use your original dataset here
X = np.random.rand(1000,100)     # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels
a = pd.DataFrame(X)
a.columns = ['param'+str(i+1) for i in range(len(a.columns))]
b = pd.DataFrame(y)

import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

model = XGBClassifier()
model.fit(a,b)

# Feature importance
model.feature_importances_

fi = pd.DataFrame({'Feature-names':a.columns,'Importances':model.feature_importances_})
fi.sort_values(by='Importances',ascending=False)


plt.bar(range(len(model.feature_importances_)),model.feature_importances_)
plt.show()

plt.rcParams.update({'figure.figsize':(20.0,180.0)})
plt.rcParams.update({'font.size':20.0})
plt.barh(a.columns,model.feature_importances_)

sorted_idx = model.feature_importances_.argsort()
plt.barh(a.columns[sorted_idx],model.feature_importances_[sorted_idx])
plt.xlabel('XGBoost Classifier Feature Importance')

#plot_importance
xgb.plot_importance(model, ax=plt.gca())
plt.show 

if you see the graph, the feature importance and plot importance do not give the same result. I try to read the documentation but I do not understand in the layman's terms so does anyone understand why plot importance does not give results equal to plot importance?

feature_sorted bar_sorted plot_importance

if I do this

fi['Importances'].sum()

I got 1.0, which means that feature importance is the percentage.

if I want to do dimensionality reduction, which feature i should use? the one that comes from feature importance or plot importance?

Gigi answered 11/8, 2022 at 9:4 Comment(0)
L
2

The scores you get are not normalized by the total.

Using your example :

import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot as plt

np.random.seed(99)
X = np.random.rand(1000,100)     # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels
a = pd.DataFrame(X)
a.columns = ['param'+str(i+1) for i in range(len(a.columns))]
b = pd.DataFrame(y)

model = XGBClassifier(importance_type = "weight")
model.fit(a,b)

xgb.plot_importance(model,max_num_features=10,importance_type = "weight")

This is the plot of top 10 most important:

enter image description here

To get the scores shown on the plot:

df = pd.DataFrame(model.get_booster().get_score(importance_type = "weight"),
             index = ["raw_importance"]).T
df[:10]

    raw_importance
param98 35
param57 30
param17 30
param20 29
param14 28
param45 27
param22 27
param59 27
param13 26
param30 26

To get back the scores under model.feature_importances_ , you need to divide the raw importance scores by the sum:

    raw_importance  normalized
param98 35  0.018747
param57 30  0.016069
param17 30  0.016069
param20 29  0.015533
param14 28  0.014997
param45 27  0.014462
param22 27  0.014462
param59 27  0.014462
param13 26  0.013926
param30 26  0.013926

You will see it's the same as what you have under the model:

pd.DataFrame(model.feature_importances_,columns=['score'],index = a.columns)\
.sort_values('score',ascending=False)[:10]

    score
param98 0.018747
param57 0.016069
param17 0.016069
param20 0.015533
param14 0.014997
param45 0.014462
param59 0.014462
param22 0.014462
param12 0.013926
param13 0.013926

So to answer your question, to rank the features, you can just use model.feature_importances_

Lavonnelaw answered 15/9, 2022 at 8:53 Comment(1)
in other words, the plot_importance plots the raw values, while the feature_importances_ provides percentages of the total (i.e. they sum to one)Secern

© 2022 - 2024 — McMap. All rights reserved.