How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Asked 28/2, 2019 at 20:33 Answered 23/5, 2022 at 10:28

python pandas dataframe xgboost data-preprocessing

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).

I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.

For instance:

    scaler = MinMaxScaler(feature_range=(0, 1))
    X = scaler.fit_transform(X)
    my_model_name = XGBClassifier()
    my_model_name.fit(X,Y)`

where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.

Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set. Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.

Anesthetist answered 28/2, 2019 at 20:33 Comment(1)

Maybe this can help #44512136 – Ephraimite 1/3, 2019 at 23:18

You can get the features names by:

model.get_booster().feature_names

Indulgent answered 10/2, 2020 at 17:36 Comment(6)

As you can see in my answer (and even in the question) this is not correct answer since you loose the original feature names when you pass numpy array into fit method. – Bounty 1/2, 2021 at 10:53

That is why you should pass DataFrame and not Numpy array. – Indulgent 4/2, 2021 at 16:24

I do not agree. Yes, probably in most cases it's the best way to go. But in other cases (even e.g. in my current project) where you have complicated data preparation process and work with NumPy arrays (from different reasons e.g. performance, ...), it's much easier to pass this array. – Bounty 5/2, 2021 at 7:42

And regarding you answer, you might add your note about using DataFrame instead of NumPy array to your answer because now it does not answer the question since the user is using NumPy array and thus using model.get_booster().feature_names does not work for him. – Bounty 5/2, 2021 at 7:44

This does not work if the model has been saved and then loaded using save_model and load_model. – Fredrickafredrickson 12/6, 2021 at 22:53

FWIW - in certain cases passing a DF is not an option and then the model.get_booster().feature_names returns None. Combining @Bounty reply I managed to SET the feature names before save_model an then they were easily available after load_model. To clarify: model.get_booster().feature_names = orig_feature_names worked. – Finder 8/8, 2022 at 7:42

You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.

But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.

Then you should be able to:

change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)

EDIT:

Thanks to @Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:

xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
model.get_booster().get_score() also uses "weight" as the default (see get_score)
model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)

For more info on this topic, look at How to get feature importance.

Bounty answered 1/2, 2021 at 10:51 Comment(2)

model.feature_importance and plot_importance(model, type = "gain), don't give out the same features, So that 3rd point is not legit. Are the numbers after f, like "f1001" indices of the features in the dataframe? – Hostess 29/11, 2021 at 14:34

@NoobProgrammer: Thanks for the comment, see the updated answer. The result should be the same, the difference is the normalization. Feel free to update the answer if you think it's not clear enough. Regarding the numbers, yes, those should be indices of the features in the dataframe (or numpy or any input data). That's why you can use model.get_booster().feature_names = orig_feature_names. Or you could parse those indices and use it directly on the resulting dict for example. – Bounty 30/11, 2021 at 8:49

I tried the above answers, and didn't work while loading the model after training. So, the working code for me is :

model.feature_names

it returns a list of the feature names

Estienne answered 4/3, 2022 at 8:59 Comment(0)

I think, it is best to turn numpy array back into pandas DataFrame. E.g.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier


Y=label

X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)

scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)

my_model_name = XGBClassifier(max_depth=2, n_estimators=2)
my_model_name.fit(X_scaled_df,Y)

xgb.plot_importance(my_model_name)
plt.show()

This will show the original names.

Apostasy answered 23/5, 2022 at 10:28 Comment(0)

Recommended topics

Hot tags