Getting a weird error when trying to run xgboost.predict or xgboost.score
Asked Answered
T

3

7

I'm trying to run an xgboost regressor model on a dataset without any missing data.

# Run GBM on training dataset
# Create xgboost object
pts_xgb = xgb.XGBRegressor(objective="reg:squarederror", missing=None, seed=42)

# Fit xgboost onto data
pts_xgb.fit(X_train
    ,y_train
    ,verbose=True
    ,early_stopping_rounds=10
    ,eval_metric='rmse'
    ,eval_set=[(X_test,y_test)])

The model creation seems to work fine, and I confirmed that X_train and y_train have no null values, using the following:

print(X_train.isnull().values.sum()) # prints 0
print(y_train.isnull().values.sum()) # prints 0

But when I run the following code, I get the below error.

Code:

pts_xgb.score(X_train,y_train)

Error:

---------------------------------------------------------------------------
XGBoostError                              Traceback (most recent call last)
<ipython-input-37-39b223d418b2> in <module>
----> 1 pts_xgb.score(X_train_test,y_train_test)

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/sklearn/base.py in score(self, X, y, sample_weight)
    551 
    552         from .metrics import r2_score
--> 553         y_pred = self.predict(X)
    554         return r2_score(y, y_pred, sample_weight=sample_weight)
    555 

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/xgboost/sklearn.py in predict(self, X, output_margin, ntree_limit, validate_features, base_margin, iteration_range)
    818         if self._can_use_inplace_predict():
    819             try:
--> 820                 predts = self.get_booster().inplace_predict(
    821                     data=X,
    822                     iteration_range=iteration_range,

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/xgboost/core.py in inplace_predict(self, data, iteration_range, predict_type, missing, validate_features, base_margin, strict_shape)
   1844             from .data import _maybe_np_slice
   1845             data = _maybe_np_slice(data, data.dtype)
-> 1846             _check_call(
   1847                 _LIB.XGBoosterPredictFromDense(
   1848                     self.handle,

/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/xgboost/core.py in _check_call(ret)
    208     """
    209     if ret != 0:
--> 210         raise XGBoostError(py_str(_LIB.XGBGetLastError()))
    211 
    212 

XGBoostError: [09:18:58] /Users/travis/build/dmlc/xgboost/src/c_api/c_api_utils.h:157: Invalid missing value: null
Stack trace:
  [bt] (0) 1   libxgboost.dylib                    0x000000011e4e7064 dmlc::LogMessageFatal::~LogMessageFatal() + 116
  [bt] (1) 2   libxgboost.dylib                    0x000000011e4d9afc xgboost::GetMissing(xgboost::Json const&) + 268
  [bt] (2) 3   libxgboost.dylib                    0x000000011e4e0a13 void InplacePredictImpl<xgboost::data::ArrayAdapter>(std::__1::shared_ptr<xgboost::data::ArrayAdapter>, std::__1::shared_ptr<xgboost::DMatrix>, char const*, xgboost::Learner*, unsigned long, unsigned long, unsigned long long const**, unsigned long long*, float const**) + 531
  [bt] (3) 4   libxgboost.dylib                    0x000000011e4e04d3 XGBoosterPredictFromDense + 339
  [bt] (4) 5   libffi.dylib                        0x00007fff2dc7f8e5 ffi_call_unix64 + 85

Same error occurs if I try to run pts_xgb.predict(X_train)

Edit: this is not an issue with any missing/null values in either X_train or y_train. I got the same error when using the following dataset which is much smaller than my actual dataset (see below):

X_train: 1

y_train: 2

Anyone have any idea why this may be happening? I couldn't find any other forums that discuss the same issue.

Tiliaceous answered 24/4, 2021 at 16:40 Comment(7)
instead of sum, try using count? If that also doesnt show null, try using NVL or coalesce to replace nulls with a string and count the instances of that stringRunnerup
I tried a few different methods and everything is turning up 0 nulls/blank fields. I even exported to Excel (using X_train.to_excel(...)) as that's where I'm a bit more comfortable and confirmed that there are no blank cells and all cells are a number.Tiliaceous
how did you confirm in Excel?Runnerup
I did COUNT() on every column (which only counts number values) and I did COUNTBLANK() on every column to confirm that there are no blank cells. COUNT() returned the exact number of rows of my data for every column, and COUNTBLANK() returned 0 for every columnTiliaceous
ok, you can try 1 thing, apply filter, drop-down and see the filter values, specifically filter values at the end of the list. Maybe its is converting nulls to ? or N/A or something else. try it for a few sample columns.Runnerup
and you can also pick a column, remove duplicates and sort, and see the values at the top or the bottom depending on whether you are sorting ascending or descendingRunnerup
Thanks for the help, but missing data is not the issue. I've edited the post as proof. Any other ideas as to where this error may be coming from?Tiliaceous
R
9

this IS a missing/null value problem

instead of xgb.XGBRegressor(objective="reg:squarederror", missing=None, seed=42)

try xgb.XGBRegressor(objective="reg:squarederror", missing=1, seed=42)

for reason, see the answer to: How to use missing parameter of XGBRegressor of scikit-learn

Runnerup answered 24/4, 2021 at 22:43 Comment(3)
Ahh okay awesome thanks for the help. Why are you suggesting to put "missing = 1" though? Aren't I better off excluding the missing parameter from my xgb regressor in the first place since I'm not actually missing any data?Tiliaceous
sure you can do that, if you look at the documentation on the following link:xgboost.readthedocs.io/en/latest/python/… "missing (float, default np.nan) – Value in the data which needs to be present as a missing value." I think the problem is that when you specify None, it is treating it as an invalid value. the missing value can be float or NaN (in case you exclude the parameter)...it cannot be a stringRunnerup
yes its better to just remove the parameter if you dont have any missing values; because it would convey implicitly that there arent any missing values in the dataset, hence no rows are being ignored while scoring or predictingRunnerup
W
0

I had this error too, for " xgb.XGBClassifier " when i wanted to draw a confusion matrix. I remove argument " missing = None " and it is OK. I hope this help you and the others who have the same issue.

Widespread answered 29/5, 2022 at 16:13 Comment(0)
P
0

I had the same issue when using missing='nan', and I fixed it by changing it to missing=float('nan').

Peccavi answered 31/1 at 21:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.