How to measure xgboost regressor accuracy using accuracy_score (or other suggested function)
Asked Answered
Y

1

7

I'm making a code to solve a simple problem of predict the probability of an item missing from an inventory.

I'm using the XGBoost prediction model to do this.

I have the data split in two .csv files, one with the Train Data and other with the Test Data

Here is the code:

    import pandas as pd
    import numpy as np


    train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
    test = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)


    X_train, y_train = train.drop('isBackorder', axis=1), train['isBackorder']

    import xgboost as xgb
    xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                    max_depth = 10, alpha = 10, n_estimators = 10)
    xg_reg.fit(X_train,y_train)


    y_pred = xg_reg.predict(test)

    # Create file for the competition submission
    test['isBackorder'] = y_pred
    pred = test['isBackorder'].reset_index()
    pred.to_csv('competitionsubmission.csv',index=False)

And here is the functions where i try to measure the accuracy of the problem (Using RMSE and the accuracy_scores function and do a KFold cross validation

#RMSE
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y_train, y_pred))
print("RMSE: %f" % (rmse))


#Accuracy
from sklearn.metrics import accuracy_score

# make predictions for test data
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))


#KFold
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# CV model
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

But i'm having some problems.

None of the accuracy test above works.

When using the RMSE function and the Accuracy function, the following error appears: ValueError: Found input variables with inconsistent numbers of samples: [1350955, 578982]

I guess that the Train and Test Data split structure that i'm using are not correct.

Since i don't have a y_test (and i don't know how to create it in my problem), i can't use it at the function's above parameters.

The K Fold validation isn't working too.

Can someone help me PLEASE?

Yahrzeit answered 3/12, 2019 at 22:25 Comment(0)
O
6

Your only issue is that you need validation data. You can't measure accuracy between the predict(x_test) and a non-existing y_test. Use sklearn.model_selection.train_test_split to make a validation set based on your training data. You will have a train, validation, and test set. You can evaluate the performance of your model on the validation set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y)

Other remarks:

Accuracy makes no sense here because you're trying to predict on continuous values. Only use accuracy for categorical variables.

At a minimum, this could work:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

train = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o periodo/Python/Trabalho Final/train.csv', index_col='sku').fillna(-1)
test_data = pd.read_csv('C:/Users/pedro/Documents/Pedro/UFMG/8o '
                    'periodo/Python/Trabalho Final/test.csv', index_col='sku').fillna(-1)

x, y = train.drop('isBackorder', axis=1), train['isBackorder']
X_train, X_test, y_train, y_test = train_test_split(x, y)

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 10, alpha = 10, n_estimators = 10)

xg_reg.fit(X_train,y_train)

kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(xg_reg, X_train, y_train, cv=kfold)
y_test_pred = xg_reg.predict(X_test)

mse = mean_squared_error(y_test_pred, y_test)

y_pred = xg_reg.predict(X_test)

pd.DataFrame(y_pred).to_csv('competitionsubmission.csv',index=False)
Obsequent answered 3/12, 2019 at 22:48 Comment(4)
Hello Nicolas, Thank you for the answer. I tried using the train_test_split function but it didn't work. I guess i didn't use it the right way. I am not sure how to separate my training set into variables X and Y to use them in the train_test_split function. Can you explain me the right way to do this? And other question. Since accuracy makes no sense to continuous values, what is the best way to measure the model efficiency? Which function do you suggest me to use?Yahrzeit
See my edit. That's all I can do. It should work. If it doesn't, the errors will be minor.Obsequent
Hi, could you explain a bit more? you generate 'results' using cross_val_score() but then don't appear to use results again. Does cross_val_score modify the xg_reg object?Kimmi
@Kimmi see the code in the question; "results" is used in the printing.Ekaterina

© 2022 - 2024 — McMap. All rights reserved.