ValueError: feature_names mismatch: in xgboost in the predict() function
Asked Answered
A

13

40

I have trained an XGBoostRegressor model. When I have to use this trained model for predicting for a new input, the predict() function throws a feature_names mismatch error, although the input feature vector has the same structure as the training data.

Also, in order to build the feature vector in the same structure as the training data, I am doing a lot inefficient processing such as adding new empty columns (if data does not exist) and then rearranging the data columns so that it matches with the training structure. Is there a better and cleaner way of formatting the input so that it matches the training structure?

Avis answered 20/2, 2017 at 7:43 Comment(0)
P
40

This is the case where the order of column-names while model building is different from order of column-names while model scoring.

I have used the following steps to overcome this error

First load the pickle file

model = pickle.load(open("saved_model_file", "rb"))

extraxt all the columns with order in which they were used

cols_when_model_builds = model.get_booster().feature_names

reorder the pandas dataframe

pd_dataframe = pd_dataframe[cols_when_model_builds]
Pawpaw answered 4/6, 2019 at 9:2 Comment(3)
I tried to check the feature names of my inference data, which is a numpy array and I got none.Psychiatrist
@Psychiatrist If at training time you fit your model with a pandas.Dataframe, then column names are retained in your serialized model (pkl). If you fit your model with numpy array, then there are no column names for xgboost to use.Jazminejazz
Just a note in case it helps others: if you're passing eval_set to fit, all datasets given in that list must also have columns in the same order. You can call X.sort_index(axis=1) on each to ensure this is true.Abrasion
D
18

Try converting data into ndarray before passing it to fit/predict. For eg: if your train data is train_df and test data is test_df. Use below code:

train_x = train_df.values
test_x = test_df.values

Now fit the model:

xgb.fit(train_x,train_y)

Finally, predict:

pred = xgb.predict(test_x)

Hope this helps!

Dygal answered 31/10, 2018 at 9:2 Comment(1)
Thanks. In my case, the error reproduced only with xgboost regressor, with other regressions worked fine.Eversion
A
9

From what I could find, the predict function does not take the DataFrame (or a sparse matrix) as input. It is one of the bugs which can be found here https://github.com/dmlc/xgboost/issues/1238

In order to get around this issue, use as_matrix() function in case of a DataFrame or toarray() in case of a sparse matrix.

This is the only workaround till the bug is fixed or the feature is implemented in a different manner.

Avis answered 20/2, 2017 at 7:47 Comment(0)
B
9

I also had this problem when i used pandas DataFrame (non-sparse representation).

I converted training and testing data into numpy ndarray.

          `X_train = X_train.as_matrix()
           X_test = X_test.as_matrix()` 

This how i got rid of that Error!

Bestir answered 21/3, 2018 at 4:27 Comment(1)
The as_matrix() method now seems depreciated. The suggestion is to use .values, which didn't work for me, but the docs are here.Manriquez
S
6

I came across the same problem and it's been solved by adding passing the train dataframe column name to the test dataframe via adding the following code:

test_df = test_df[train_df.columns]
Shiprigged answered 1/3, 2018 at 0:10 Comment(0)
E
3

Check the exception. What you should see are two arrays. One is the column names of the dataframe you’re passing in and the other is the XGBoost feature names. They should be the same length. If you put them side by side in an Excel spreadsheet you will see that they are not in the same order. My guess is that the XGBoost names were written to a dictionary so it would be a coincidence if the names in then two arrays were in the same order.

The fix is easy. Just reorder your dataframe columns to match the XGBoost names:

f_names = model.feature_names
df = df[f_names]
Enlarge answered 11/6, 2018 at 19:55 Comment(2)
AttributeError: 'XGBRegressor' object has no attribute 'feature_names'Lomax
@Lomax You could try this :model.get_booster().feature_namesSwiger
C
2

I'm contributing an answer as I experienced this problem when putting a fitted XGBRegressor model into production. Thus, this is a solution for cases where you cannot select column names from a y training or testing DataFrame, though there may be cross-over which could be helpful.

The model had been fit on a Pandas DataFrame, and I was attempting to pass a single row of values as a np.array to the predict function. Processing the values of the array had already been performed (reverse label encoded, etc.), and the array was all numeric values.

I got the familiar error:

ValueError: feature_names mismatch followed by a list of the features, followed by a list of the same length: ['f0', 'f1' ....]

While there are no doubt more direct solutions, I had little time and this fixed the problem:

  1. Make the input vector a Pandas Dataframe:
series = {'feature1': [value],
          'feature2': [value],
          'feature3': [value],
          'feature4': [value],
          'feature5': [value],
          'feature6': [value],
          'feature7': [value],
          'feature8': [value],
          'feature9': [value],
          'feature10': [value]
           }

self.vector = pd.DataFrame(series)
  1. Get the feature names that the trained model knows:

names = model.get_booster().feature_names

  1. Select those feature from the input vector DataFrame (defined above), and perform iloc indexing:

result = model.predict(vector[names].iloc[[-1]])


The iloc transformation I found here.

Selecting the feature names – as models in the Scikit Learn implementation do not have a feature_names attribute – using get_booster( ).feature_names I found in @Athar post above.

Check out the the documentation to learn more.

Hope this helps.

Coster answered 17/8, 2019 at 12:0 Comment(0)
A
1

Do this while creating the DMatrix for XGB:

dtrain = xgb.DMatrix(np.asmatrix(X_train), label=y_train)
dtest = xgb.DMatrix(np.asmatrix(X_test), label=y_test)

Do not pass X_train and X_test directly.

Autolysin answered 25/10, 2018 at 10:13 Comment(0)
G
1

XGBoostRegressor needs the columns(features) to be in the same order.

Try

DataFrama = DataFrame.reindex(sorted(DataFrame.columns), axis = 1)

Apply it on both train and test feature datasets.

Gyve answered 8/3, 2021 at 21:54 Comment(0)
N
1

I was also facing the same issue and tried all techniques all failed. I was using the Pima diabetes dataset model. fit() was good but when it comes to manual testing using predict it was throwing errors in missing features names. Then I've tried something which works for me.

test1=[[6,148,72,35,0,33,0.8,54]]
test2= pd.DataFrame(test1,columns= 
['Pregnancies','Glucose','BloodPressure','SkinThickness',
'Insulin','BMI','DiabetesPedigreeFunction','Age'],dtype=float)
p=classifier.predict(test2)
print("Diabetes [0 - No Yes - 1] :\n Result : ",p[0])

The Columns are basically the independent variables columns in my dataset.

Now there will be a question then each time do I need to try this complex method to just predict some. so the answer is No. After you pickle the model you can easily pass the model.predict([[test]]) there will be no problem

you can see the complete code here

Nonconcurrence answered 27/5, 2021 at 8:15 Comment(0)
W
0

Instead of xgb.predict(12,34,344), try:

z = [[12, 34, 344]]
y = pd.DataFrame(z)
xgb.predict(y)
Whiteness answered 30/6, 2022 at 12:33 Comment(0)
A
0

One line solution:

XGboost expects columns order and size should be same for test set as same as training set used during fitting model. This can be done in single line.

Example:

X_train, X_test = X_train.align(X_test, join='left', axis=1)

This line of code is using the align() function from the pandas library to align two dataframes, X_train and X_test, along their columns (since axis=1). The join='left' parameter means that the resulting dataframes will have columns that are present in the left dataframe (X_train) ¹.

The align() function doesn't combine two dataframes, rather it aligns them so that the two dataframes have the same row and/or column configuration.

Aerial answered 26/7, 2023 at 11:5 Comment(0)
G
0

Had the same issue. The problem was that one variable was twice in the list of features, so I got this error.

Gerrygerrymander answered 1/12, 2023 at 1:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.