Plot feature importance with xgboost

Asked 18/8, 2018 at 5:22 Answered 4/11, 2021 at 13:7

Solved python matplotlib machine-learning xgboost feature-selection

When I plot the feature importance, I get this messy plot. I have more than 7000 variables. I understand the built-in function only selects the most important, although the final graph is unreadable. This is the complete code:

import numpy as np
import pandas as pd
df = pd.read_csv('ricerice.csv')
array=df.values
X = array[:,0:7803]
Y = array[:,7804]
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X, Y)
import matplotlib.pyplot as plt
from matplotlib import pyplot
from xgboost import plot_importance
fig1=plt.gcf()
plot_importance(model)
plt.draw()
fig1.savefig('xgboost.png', figsize=(50, 40), dpi=1000)

Although the size of the figure, the graph is illegible.

Hodman answered 18/8, 2018 at 5:22 Comment(1)

The best solution is to focus on the top n important features, for example the top 10: xgboosting.com/xgboost-plot-top-10-most-important-features – Brockington 16/5 at 20:15

There are couple of points:

To fit the model, you want to use the training dataset (X_train, y_train), not the entire dataset (X, y).
You may use the max_num_features parameter of the plot_importance() function to display only top max_num_features features (e.g. top 10).

With the above modifications to your code, with some randomly generated data the code and output are as below:

import numpy as np

# generate some random data for demonstration purpose, use your original dataset here
X = np.random.rand(1000,100)     # 1000 x 100 data
y = np.random.rand(1000).round() # 0, 1 labels

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
seed=0
test_size=0.30
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=test_size, random_state=seed)
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(X_train, y_train)
import matplotlib.pylab as plt
from matplotlib import pyplot
from xgboost import plot_importance
plot_importance(model, max_num_features=10) # top 10 most important features
plt.show()

Fencible answered 18/8, 2018 at 6:37 Comment(3)

how do i get what f39 is? – Lindner 27/5, 2020 at 9:50

use model.get_booster().get_score(importance_type='weight') to get importance of all features. – Fencible 27/5, 2020 at 10:50

if you're using make_pipeline to instantiate your model, then you can use the following to assign feature names: xg_boost.get_booster().feature_names = list(ml_pipeline[0].get_feature_names_out()) – Langsyne 7/12, 2022 at 11:51

You need to sort your feature importances in descending order first:

sorted_idx = trained_mdl.feature_importances_.argsort()[::-1]

Then just plot them with the column names from your dataframe

from matplotlib import pyplot as plt
n_top_features = 10
sorted_idx = trained_mdl.feature_importances_.argsort()[::-1]
plt.barh(X_test.columns[sorted_idx][:n_top_features ], trained_mdl.feature_importances_[sorted_idx][:n_top_features ])

Tradelast answered 4/11, 2021 at 13:7 Comment(0)

You can obtain feature importance from Xgboost model with feature_importances_ attribute. In your case, it will be:

model.feature_imortances_

This attribute is the array with gain importance for each feature. Then you can plot it:

from matplotlib import pyplot as plt
plt.barh(feature_names, model.feature_importances_)

(feature_names is a list with features names)

You can sort the array and select the number of features you want (for example, 10):

sorted_idx = model.feature_importances_.argsort()
plt.barh(feature_names[sorted_idx][:10], model.feature_importances_[sorted_idx][:10])
plt.xlabel("Xgboost Feature Importance")

There are two more methods to get feature importance:

you can use permutation_importance from scikit-learn (from version 0.22)
you can use SHAP values

You can read more in this blog post of mine.

Averill answered 17/8, 2020 at 11:51 Comment(1)

you need to sort descending order to make this work correctly. – Tradelast 4/11, 2021 at 13:4

Recommended topics

Hot tags