Retrieve list of training features names from classifier
Asked Answered
D

5

17

Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data. The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier.

Doolittle answered 8/11, 2016 at 11:6 Comment(1)
I am having the same issue. This is a big issue for me because the sequence of features can be permuted in my preprocessing. When I preprocess my prediction data to predict with the model I no longer have a way to know what the sequence of features I trained with where. I only keep the pkl. I can of course also store a list but this seems like a poor method.Plumy
A
4

Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.

Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.

Alvord answered 10/11, 2016 at 9:14 Comment(1)
One concern is that another programmer in the future might present a data set that the prediction object cannot handle. Assuming the original code is not available, it is a basic expectation that the model object should be able to spit out all parameters relevant to the computation. In short, the object should be self-explaining.Wantage
P
34

I have a solution which works but is not very elegant. This is an old post with no existing solutions so I suppose there are not any.

Create and fit your model. For example

model = GradientBoostingRegressor(**params)
model.fit(X_train, y_train)

Then you can add an attribute which is the 'feature_names' since you know them at training time

model.feature_names = list(X_train.columns.values)

I typically then put the model into a binary file to pass it around but you can ignore this

joblib.dump(model, filename)
loaded_model = joblib.load(filename)

Then you can get the feature names back from the model to use them when you predict

f_names = loaded_model.feature_names
loaded_model.predict(X_pred[f_names])
Plumy answered 21/12, 2018 at 0:56 Comment(2)
on the contrary, I find this solution elegant, all other solutions I found need two pages of code to extract a list of names. one thing thought, shouldn't it be """X_train.columns""" without the .values?Primp
Taken this approach as well :), it's worth noting that this has some precedent with models like XGBoost which write labels to attribute "feature_names". And @FlyingTurtle you're correct, list(df.columns.values) returns same result list(df.columns), OP might have just felt more comfortable converting from index --> np.array --> list but can go direct from index --> list.Bartholemy
A
4

Based on the documentation and previous experience, there is no way to get a list of the features considered at least at one of the splitting.

Is your concern that you do not want to use all your features for prediction, just the ones actually used for training? In this case I suggest to list the feature_importances_ after fitting and eliminate the features that does not seem relevant. Then train a new model with only the relevant features and use those features for prediction as well.

Alvord answered 10/11, 2016 at 9:14 Comment(1)
One concern is that another programmer in the future might present a data set that the prediction object cannot handle. Assuming the original code is not available, it is a basic expectation that the model object should be able to spit out all parameters relevant to the computation. In short, the object should be self-explaining.Wantage
K
0

7 years 7 months late, but hey, there's an update. So you've trained the model and saved it using joblib or pickle.

X moments later

model = joblib.load(filename)
# model = pickle.load(open(filename, 'rb'))
model.feature_names_in_

returns an nd-array of shape (n_features,)

Where n_features is the number of features seen during fit. Defined only when X has feature names that are all strings.

This was added in version 1.0 of the scikit-learn module. For more details and cool functions like get_params and feature_importances_ as mentioned by Adam Jermann, refer this link

Kidwell answered 25/6 at 12:2 Comment(0)
S
-3

You can extract feature names from a trained XGBOOST model as follows:

model.get_booster().feature_names
Significant answered 1/10, 2020 at 1:23 Comment(0)
W
-4

You don't need to know which features were selected for the training. Just make sure to give, during the prediction step, to the fitted classifier the same features you used during the learning phase.

The Random Forest Classifier will only use the features on which it makes its splits. Those will be the same as those learnt during the first phase. Others won't be considered.

If the shape of your test data is not the same as the training data it will throw an error, even if the test data contains all the features used for the splits of you decision trees.

What's more, since Random Forests make random selection of features for your decision trees (called estimators in sklearn) all the features are likely to be used at least once.


However, if you want to know the features used, you can just call the attributes n_features_ and feature_importances_ on your classifier once fitted.

You can look here to see how you can retrieve the names of the most important features you used.

Westwardly answered 8/11, 2016 at 11:15 Comment(9)
the solution you suggest returns only the number of features and not their name. I know that at prediction stage, there is no need to provide the name but just the same features. However, in my case, I don't know beforehand what were the selected features for training (and the column order if this matters).Doolittle
@Doolittle : Why do you want to know the features selected for training ?Westwardly
I am creating a function which takes as an argument a classifier. In order to not hardcode lists of variables which I may not know a priori, or avoid typos or just typing endless lists of variables, it would be nice to have it stored in the classifier itself.Doolittle
The link to sklearn you are refering to does not show how to get the names, but only the indices of the already chosen training variables. Thanks anyway!Doolittle
Thank you, I got the point, however this doesn't help getting the feature names from the classifier. If I perform data processing in a different way than another person, the order and indices of columns/features in the dataframe may be different. Unless I missed your point... That's why I really want the feature name from the classifier itself and not from the training or application data.Doolittle
@Doolittle - it's been a few years, but I wonder if you ever found a satisfactory solution. I am completely with you; if a prediction crashes, the object itself should have enough information. Sometimes I think teenagers devised this part of the code, because they're used to having a great memory. "Oh, sure, I know where that code is!" It comes down to sklearn relying on the position of columns in a file, and this is simply not good practice.Wantage
@mmf precisely how does one do that without access to the original data set?Wantage
@MichaelTuchman I usually store the list of features names, which I found moderately satisfactory. If I had more time and a clearer use case, I'd probably cerate my own custom classifier class with some additional information, MTC.Doolittle
@Westwardly My original concern was only that I need to pass the features in the same order in training and testing phase. However the preprocessing may shuffle this order and give hard time for somebody taking over this part of my code.Doolittle

© 2022 - 2024 — McMap. All rights reserved.